Global Economic Efficiency and Cost of Living Analysis 2024: A Comprehensive Study Using Multivariate and Geospatial Analytics
Introduction¶
In an era of rapid globalization and increasing economic interdependence, understanding the cost of living across various countries has never been more crucial. The "Global Economic Efficiency and Cost of Living Analysis 2024" project seeks to provide a thorough examination of the relative affordability and economic efficiency of living in different regions worldwide. This analysis leverages multiple data sources, including the Cost of Living Index data from Numbeo, GDP and GNI data from the World Bank, and additional economic indicators from the International Monetary Fund and United Nations. The dataset offers comprehensive indices such as the Cost of Living Index, Rent Index, Groceries Index, and Local Purchasing Power Index, each benchmarked against New York City (NYC) as the standard. These indices offer valuable insights into the cost of consumer goods, housing, groceries, dining, and the purchasing power of residents in various countries. By employing advanced data analytics techniques—including multivariate analysis, clustering, regression modeling, and geospatial analysis—this project aims to uncover the global variations in cost of living, economic efficiency, and the impact of global economic trends on these factors. This project is also designed to serve as a valuable educational resource for students, particularly those studying economics, data science, and international relations. Through the detailed exploration of real-world data, students can gain hands-on experience with advanced analytical tools and methodologies, enhancing their understanding of global economic dynamics. The integration of foundational economic theories, such as Marshall's (1890) analysis of cost and utility and Keynes's (1936) focus on purchasing power, provides a strong theoretical framework that students can apply to their own studies. Drawing from these theories, the project not only contributes to academic discourse but also equips students with practical skills in data analysis and economic interpretation, preparing them for future careers in a globally connected economy.
Background and Analysis¶
Cost of living is a pivotal factor in determining economic well-being, influencing both individual quality of life and broader economic health across regions. In today's globally mobile society, individuals, businesses, and policymakers are increasingly focused on how different locations compare regarding living expenses and economic efficiency. This project integrates data from several authoritative sources to conduct a robust analysis of these factors, providing a direct comparison with New York City—a global financial hub known for its high living costs. Utilizing advanced analytical methodologies, this project will explore the complex interplay between cost of living and economic efficiency across various regions. The analysis will encompass a wide range of techniques, including multivariate statistical analysis, clustering methods, regression models, and geospatial analysis. These methods will uncover the underlying patterns and drivers of cost of living disparities and assess the efficiency with which different economies manage living costs relative to income levels. Additionally, the project will consider the implications of global economic trends, such as inflation, trade agreements, and economic recessions, on the cost of living and economic efficiency. By providing a comprehensive examination of these factors, this project aims to offer valuable insights for policymakers, businesses, and individuals navigating the complex landscape of global living costs and economic conditions.
Table of Contents¶
- Introduction
- Background and Analysis
- 1. Data Overview
- 2. Data Import and Preparation
- 3. Exploratory Data Analysis (EDA)
- 4. Correlation Analysis
- 5. Cluster Analysis
- 6. Principal Component Analysis (PCA)
- 7. Regression Analysis
- 8. Geospatial Analysis
- 9. Machine Learning
- 10. Hypothesis Testing
- 11. Economic Efficiency Analysis
- 12. Scenario Analysis and Simulations
- 13. Conclusion
- References
- Data Sources
The dataset utilized in this project, provided by Numbeo, consists of key indices reflecting the cost of living and purchasing power across various countries, benchmarked against New York City (NYC). It includes data on the cost of living (excluding rent), rent prices, combined cost of living plus rent, grocery costs, restaurant prices, and local purchasing power. Each index is calculated using standardized methods based on user-contributed data, ensuring consistent comparisons across countries. This dataset serves as the foundation for the advanced analytics performed in this study, offering a comprehensive view of global economic conditions in 2024.
# Basic Libraries
import pandas as pd # For data manipulation and analysis
import numpy as np # For numerical operations
import matplotlib.pyplot as plt # For plotting and visualizations
import seaborn as sns # For statistical data visualization
# Machine Learning and Statistical Analysis
from sklearn.decomposition import PCA # For Principal Component Analysis
from sklearn.cluster import KMeans # For K-means clustering
from sklearn.linear_model import LinearRegression # For regression analysis
from sklearn.model_selection import train_test_split # For splitting data into training and testing sets
from sklearn.preprocessing import StandardScaler # For feature scaling
from sklearn.metrics import mean_squared_error, r2_score # For evaluating regression models
from sklearn.ensemble import RandomForestRegressor # For random forest regression
from sklearn.svm import SVR # For Support Vector Regression
from scipy.stats import pearsonr, ttest_ind, f_oneway # For statistical tests
# Geospatial Libraries
import geopandas as gpd # For geospatial data manipulation
import folium # For interactive maps
# Advanced Data Visualization
import plotly.express as px # For advanced interactive visualizations
# Machine Learning Interpretation
import shap # For model interpretation using SHAP values
# Others
import warnings
warnings.filterwarnings('ignore') # To suppress warnings for cleaner outputs
The libraries imported above serve various purposes essential for the analysis in this project. Pandas and NumPy are foundational for data manipulation and numerical operations. Matplotlib and Seaborn enable the creation of both basic and statistical visualizations. Scikit-learn provides powerful tools for machine learning tasks such as PCA, clustering, and regression, while Scipy supports statistical testing. Geopandas and Folium are used for geospatial data handling and interactive mapping. Plotly Express offers advanced interactive visualizations, and SHAP is employed to interpret machine learning models, particularly in understanding feature importance. Together, these libraries facilitate a comprehensive and advanced analysis of the dataset.
# Load the dataset
file_path = r'C:\Users\eskbe\OneDrive\Desktop\Esk\August 17 Data Science Project\Cost of Living Index\Cost_of_Living_Index_by_Country_2024.csv'
data = pd.read_csv(file_path)
# Display the first few rows of the dataset to understand its structure
data.head()
# Check for any missing values in the dataset
data.isnull().sum()
# Display summary statistics to get an overview of the data
data.describe()
| Rank | Cost of Living Index | Rent Index | Cost of Living Plus Rent Index | Groceries Index | Restaurant Price Index | Local Purchasing Power Index | |
|---|---|---|---|---|---|---|---|
| count | 121.000000 | 121.000000 | 121.000000 | 121.000000 | 121.000000 | 121.000000 | 121.000000 |
| mean | 61.000000 | 43.555372 | 16.052893 | 30.357851 | 44.228926 | 36.471074 | 65.094215 |
| std | 35.073732 | 16.147574 | 11.412267 | 13.263721 | 17.055109 | 18.258110 | 39.569094 |
| min | 1.000000 | 18.800000 | 2.400000 | 11.100000 | 17.500000 | 12.800000 | 2.300000 |
| 25% | 31.000000 | 30.200000 | 8.500000 | 19.800000 | 31.600000 | 21.600000 | 34.800000 |
| 50% | 61.000000 | 39.500000 | 12.400000 | 27.000000 | 40.500000 | 33.100000 | 50.600000 |
| 75% | 91.000000 | 52.800000 | 20.100000 | 37.000000 | 53.700000 | 47.200000 | 99.400000 |
| max | 121.000000 | 101.100000 | 67.200000 | 74.900000 | 109.100000 | 97.000000 | 182.500000 |
Here, the dataset is imported using Pandas, followed by a quick inspection of the first few rows to confirm the data's structure and content. The next step involves checking for any missing values, which is critical for maintaining the integrity of the analysis. Summary statistics are also generated to provide insights into the central tendencies, variability, and overall distribution of the numerical data. These preparatory actions ensure that the data is properly organized and ready for the subsequent advanced analyses.
Exploratory Data Analysis (EDA) is a critical step in understanding the underlying patterns, relationships, and characteristics of the dataset. This process involves visualizing and summarizing the data to identify trends, correlations, and potential outliers that might influence the outcomes of more advanced analyses. EDA helps in forming hypotheses and guiding the direction of further analysis.
# Set up the matplotlib figure
plt.figure(figsize=(18, 12))
# Plot distribution for each index
indices = ['Cost of Living Index', 'Rent Index', 'Cost of Living Plus Rent Index', 'Groceries Index', 'Restaurant Price Index', 'Local Purchasing Power Index']
for i, index in enumerate(indices):
plt.subplot(2, 3, i+1)
sns.histplot(data[index], kde=True, bins=30)
plt.title(f'Distribution of {index}')
plt.xlabel(index)
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
The distribution plots of the Cost of Living Index, Rent Index, Groceries Index, and Local Purchasing Power Index reveal patterns of skewness, with most indices showing a right-skewed distribution. This indicates that a majority of countries have lower index values, with a few countries exhibiting higher costs or purchasing power. These visualizations are essential for identifying the overall spread and any anomalies in the data, providing a foundational understanding for further analysis.
# Pair plot to show relationships between indices
sns.pairplot(data[indices])
plt.suptitle('Pairwise Relationships Between Indices', y=1.02)
plt.show()
The pair plot above reveals the relationships between different indices in the dataset. Strong linear correlations are evident between several pairs, such as the Cost of Living Index, Rent Index, and the Cost of Living Plus Rent Index, indicating that as one increases, the others tend to increase as well. These relationships suggest that countries with higher living costs tend to have higher rent prices and overall expenses when combining rent with other living costs. Identifying these correlations is crucial for understanding how different economic factors are interconnected, which will guide the more detailed analyses that follow.
import plotly.express as px
# Create the interactive correlation matrix with light blue and light red colors
fig = px.imshow(corr_matrix,
text_auto=True,
color_continuous_scale=['#87CEFA', '#FFA07A'], # Light blue and light red colors
aspect='auto')
# Update the layout to center the title
fig.update_layout(
title={
'text': "Interactive Correlation Matrix of Indices",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font=dict(size=20),
coloraxis_colorbar=dict(title="Correlation"),
)
# Show the figure
fig.show()
The interactive correlation matrix above provides a detailed view of the relationships between different indices in the dataset. The strong positive correlations, particularly between the Cost of Living Index, Rent Index, and Cost of Living Plus Rent Index, suggest that these factors are closely interrelated; as the cost of living increases, rent and overall living expenses tend to rise accordingly. Interestingly, the Local Purchasing Power Index shows a moderate correlation with other indices, indicating that while higher costs often accompany higher purchasing power, the relationship is not as strong as with other factors. This visualization not only highlights the interconnectedness of these economic indicators but also helps to identify key areas where cost factors are most closely aligned.
Summary of Findings from Exploratory Data Analysis (EDA):¶
The exploratory data analysis reveals several key insights about the dataset. First, the distribution of indices such as the Cost of Living Index, Rent Index, and Local Purchasing Power Index shows a right-skewed pattern, indicating that most countries have lower values, with a few outliers displaying significantly higher costs and purchasing power. The pairwise relationships highlight strong correlations between indices like the Cost of Living, Rent, and Groceries indices, suggesting that these factors are tightly linked—countries with higher costs of living tend to also have higher rent and grocery prices. Finally, the correlation matrix underscores these relationships, with particularly strong positive correlations between the Cost of Living, Rent, and Cost of Living Plus Rent indices. This analysis suggests that in countries where living costs are high, rent and overall expenses are also elevated, and while purchasing power increases with higher costs, the relationship is not as pronounced. These insights lay the foundation for deeper analysis into the factors driving these economic conditions across different countries.
In the previous section, we explored the general relationships between variables using visualizations such as pair plots and a correlation matrix. In this section, we will perform a more detailed correlation analysis to quantify and interpret the strength and direction of relationships between key indices.
# Calculate Pearson correlation coefficients
correlation_matrix = data[indices].corr(method='pearson')
# Display the correlation matrix
correlation_matrix
| Cost of Living Index | Rent Index | Cost of Living Plus Rent Index | Groceries Index | Restaurant Price Index | Local Purchasing Power Index | |
|---|---|---|---|---|---|---|
| Cost of Living Index | 1.000000 | 0.820885 | 0.971780 | 0.958452 | 0.945483 | 0.692688 |
| Rent Index | 0.820885 | 1.000000 | 0.932425 | 0.770944 | 0.763537 | 0.683912 |
| Cost of Living Plus Rent Index | 0.971780 | 0.932425 | 1.000000 | 0.924935 | 0.913618 | 0.720701 |
| Groceries Index | 0.958452 | 0.770944 | 0.924935 | 1.000000 | 0.855057 | 0.640634 |
| Restaurant Price Index | 0.945483 | 0.763537 | 0.913618 | 0.855057 | 1.000000 | 0.673539 |
| Local Purchasing Power Index | 0.692688 | 0.683912 | 0.720701 | 0.640634 | 0.673539 | 1.000000 |
The table above displays the Pearson correlation coefficients between the key indices in the dataset. The results show strong positive correlations between several indices, particularly the Cost of Living Index, Rent Index, and Cost of Living Plus Rent Index. For example, the Cost of Living Index is highly correlated with the Rent Index (0.82) and the Groceries Index (0.96), indicating that countries with higher overall living costs tend to have higher rent and grocery prices as well. Similarly, the Cost of Living Plus Rent Index shows a strong correlation with both the Groceries Index (0.92) and the Restaurant Price Index (0.91). These findings suggest that various components of living costs are closely interlinked, reinforcing the idea that high costs in one area are often accompanied by high costs in others. Understanding these relationships is crucial for interpreting the economic conditions across different countries and will inform the subsequent analyses in this project.
from scipy.stats import pearsonr
# Test the significance of the correlations
p_values = data[indices].corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(len(data[indices].columns))
# Display the p-values
p_values
| Cost of Living Index | Rent Index | Cost of Living Plus Rent Index | Groceries Index | Restaurant Price Index | Local Purchasing Power Index | |
|---|---|---|---|---|---|---|
| Cost of Living Index | 0.000000e+00 | 9.894907e-31 | 1.687787e-76 | 1.132141e-66 | 8.094302e-60 | 1.349649e-18 |
| Rent Index | 9.894907e-31 | 0.000000e+00 | 1.943347e-54 | 4.555238e-25 | 2.380434e-24 | 5.356191e-18 |
| Cost of Living Plus Rent Index | 1.687787e-76 | 1.943347e-54 | 0.000000e+00 | 8.086503e-52 | 2.451265e-48 | 1.169151e-20 |
| Groceries Index | 1.132141e-66 | 4.555238e-25 | 8.086503e-52 | 0.000000e+00 | 9.739186e-36 | 2.507723e-15 |
| Restaurant Price Index | 8.094302e-60 | 2.380434e-24 | 2.451265e-48 | 9.739186e-36 | 0.000000e+00 | 2.570035e-17 |
| Local Purchasing Power Index | 1.349649e-18 | 5.356191e-18 | 1.169151e-20 | 2.507723e-15 | 2.570035e-17 | 0.000000e+00 |
The table above presents the p-values associated with the Pearson correlation coefficients between the indices. These p-values indicate the statistical significance of the observed correlations. As shown, the p-values are extremely low (most being close to 0), suggesting that the correlations between the indices are statistically significant. This implies that the relationships observed—such as the strong correlations between the Cost of Living Index, Rent Index, and Groceries Index—are unlikely to be due to random chance. The significance of these correlations confirms the robustness of the connections between different cost indices, reinforcing the validity of further analysis based on these relationships.
# Example interpretation (to be expanded based on actual results)
if p_values.min().min() < 0.05:
print("Several correlations between the indices are statistically significant, indicating robust relationships.")
else:
print("The correlations observed may not be statistically significant, suggesting weaker or more complex relationships between the indices.")
Several correlations between the indices are statistically significant, indicating robust relationships.
Several correlations between the indices are statistically significant, indicating robust relationships. The strong positive correlations observed, particularly among the Cost of Living Index, Rent Index, and Groceries Index, suggest that these factors are closely interconnected across countries. The statistical significance of these correlations confirms that the observed relationships are not due to random variation but reflect underlying economic patterns. This understanding provides a solid foundation for the more complex analyses that will follow, such as clustering and regression modeling, where these relationships will be further explored and leveraged to gain deeper insights into global economic conditions.
Cluster analysis is a powerful technique used to group countries based on similarities in their economic indices. By identifying clusters of countries with similar cost of living, rent, and purchasing power, we can gain insights into regional economic patterns and potentially discover groups of countries that share similar economic characteristics.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Define the features for clustering
features = data[['Cost of Living Index', 'Rent Index', 'Cost of Living Plus Rent Index', 'Groceries Index', 'Restaurant Price Index', 'Local Purchasing Power Index']]
# Determine the optimal number of clusters using the Elbow Method
sse = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(features)
sse.append(kmeans.inertia_)
# Plot the results of the Elbow Method
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), sse, marker='o')
plt.title('Elbow Method For Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Sum of Squared Distances')
plt.show()
The Elbow Method plot above helps determine the optimal number of clusters by visualizing the sum of squared distances (inertia) against the number of clusters. In this plot, there is a noticeable "elbow" at 3 clusters, indicating that adding more clusters beyond this point does not significantly reduce the inertia. Therefore, the optimal number of clusters for this dataset appears to be 3. This finding suggests that the dataset can be effectively segmented into 3 distinct groups of countries based on their economic indices, which will be explored further in the subsequent steps.
# Apply K-Means with the optimal number of clusters
optimal_clusters = 3 # Replace with the optimal number found from the Elbow Method
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
data['Cluster'] = kmeans.fit_predict(features)
# Visualize the clusters with a custom color palette
plt.figure(figsize=(12, 8))
custom_palette = ['purple', 'teal', 'blue'] # Replacing yellow with blue
sns.scatterplot(x='Cost of Living Index', y='Local Purchasing Power Index', hue='Cluster', data=data, palette=custom_palette)
plt.title('Clusters of Countries Based on Economic Indices')
plt.show()
# Define the features for clustering
features = data[['Cost of Living Index', 'Rent Index', 'Cost of Living Plus Rent Index', 'Groceries Index', 'Restaurant Price Index', 'Local Purchasing Power Index']]
# Apply K-Means with the optimal number of clusters (based on previous Elbow method, assume 3)
optimal_clusters = 3
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
data['Cluster'] = kmeans.fit_predict(features)
# Group countries by their assigned cluster
clustered_countries = data[['Country', 'Cluster']].groupby('Cluster')['Country'].apply(list)
# Display the countries in each cluster
for cluster, countries in clustered_countries.items():
print(f"Cluster {cluster}: {', '.join(countries)}")
Cluster 0: Barbados, Italy, Cyprus, Uruguay, Jamaica, Malta, Trinidad And Tobago, Costa Rica, Greece, Estonia, Slovenia, Latvia, Spain, Lithuania, Slovakia, Czech Republic, Panama, Croatia, Taiwan, Portugal, Hungary, Poland, Montenegro, Bulgaria, Romania, Fiji, South Africa, China, Malaysia, India Cluster 1: Cuba, Albania, Lebanon, Palestine, Jordan, Armenia, Mexico, El Salvador, Chile, Guatemala, Venezuela, Dominican Republic, Serbia, Turkey, Cambodia, Cameroon, Zimbabwe, Mauritius, Bosnia And Herzegovina, Sri Lanka, Thailand, Moldova, Georgia, North Macedonia, Ecuador, Kazakhstan, Nigeria, Azerbaijan, Philippines, Russia, Ghana, Brazil, Kenya, Botswana, Peru, Morocco, Kosovo (Disputed Territory), Argentina, Iraq, Uganda, Algeria, Colombia, Vietnam, Tunisia, Bolivia, Kyrgyzstan, Indonesia, Iran, Uzbekistan, Belarus, Ukraine, Nepal, Paraguay, Madagascar, Syria, Tanzania, Bangladesh, Egypt, Libya, Pakistan Cluster 2: Switzerland, Bahamas, Iceland, Singapore, Norway, Denmark, Hong Kong (China), United States, Australia, Austria, Canada, New Zealand, Ireland, France, Puerto Rico, Finland, Netherlands, Israel, Luxembourg, Germany, United Kingdom, Belgium, South Korea, Sweden, United Arab Emirates, Bahrain, Qatar, Japan, Saudi Arabia, Oman, Kuwait
The scatter plot visualizes the results of K-Means clustering, grouping countries into three distinct clusters based on their economic indices. Cluster 0 (Purple) includes countries like Cuba, Albania, and Lebanon, characterized by lower costs of living and moderate purchasing power. Cluster 1 (Teal) features countries such as Switzerland, Iceland, and Singapore, which have higher costs of living and varied purchasing power. Cluster 2 (Blue) consists of countries like Bahamas, Barbados, and Japan, where both the cost of living and purchasing power are the highest. This segmentation highlights how countries with similar economic profiles tend to cluster together, providing insights into global economic patterns and regional similarities.
Summary of Cluster Analysis¶
The cluster analysis applied K-Means clustering to categorize countries based on their economic indices. The Elbow Method, illustrated in the first plot, identified three clusters as the optimal segmentation, where the sum of squared distances notably decreases and begins to stabilize. The second plot visualizes these three clusters using the Cost of Living Index and Local Purchasing Power Index. Cluster 0 (Purple) comprises countries with lower costs of living and moderate purchasing power. Cluster 1 (Teal) includes countries with slightly higher costs of living and varied purchasing power. Cluster 2 (Blue) consists of countries with the highest costs of living and purchasing power. This clustering effectively groups countries with similar economic profiles, uncovering patterns that may be linked to geographic or regional economic similarities. These insights are essential for a deeper understanding of the global economic landscape and will guide further analysis.
Principal Component Analysis (PCA) is a dimensionality reduction technique that helps to simplify the complexity of high-dimensional data while retaining its most important features. By transforming the original features into a new set of orthogonal components, PCA can reveal underlying structures in the data that are not immediately apparent.
6.1 Applying PCA¶
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Standardize the features before applying PCA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
# Apply PCA
pca = PCA(n_components=2) # We start with 2 components for visualization purposes
principal_components = pca.fit_transform(scaled_features)
# Add the principal components to the dataframe
data['PC1'] = principal_components[:, 0]
data['PC2'] = principal_components[:, 1]
# Visualize the principal components
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PC1', y='PC2', hue='Cluster', data=data, palette='viridis')
plt.title('PCA of Economic Indices')
plt.xlabel(f'Principal Component 1 ({pca.explained_variance_ratio_[0]*100:.2f}% Variance)')
plt.ylabel(f'Principal Component 2 ({pca.explained_variance_ratio_[1]*100:.2f}% Variance)')
plt.show()
The scatter plot above visualizes the results of applying Principal Component Analysis (PCA) to the economic indices dataset. The PCA reduces the dimensionality of the data while retaining the most important features, allowing us to observe the distribution of the clusters in a two-dimensional space. The first principal component (PC1), which accounts for 85.24% of the variance, and the second principal component (PC2), accounting for 7.51% of the variance, are plotted on the x and y axes, respectively. The clustering is still evident in this reduced space, with Cluster 0 (Purple), Cluster 1 (Teal), and Cluster 2 (Yellow) separating based on the combined influence of the original indices. This visualization provides a clearer understanding of how these economic indices contribute to the overall variance in the dataset and the distinctions between clusters.
# Get the loadings (correlation between original features and the principal components)
loadings = pd.DataFrame(pca.components_.T, columns=['PC1', 'PC2'], index=features.columns)
# Display the loadings
loadings
| PC1 | PC2 | |
|---|---|---|
| Cost of Living Index | 0.432645 | -0.228391 |
| Rent Index | 0.397578 | 0.092917 |
| Cost of Living Plus Rent Index | 0.437963 | -0.106309 |
| Groceries Index | 0.413752 | -0.299175 |
| Restaurant Price Index | 0.413335 | -0.191430 |
| Local Purchasing Power Index | 0.347708 | 0.895405 |
The table above displays the loadings of each original economic index on the first two principal components (PC1 and PC2). These loadings indicate how much each index contributes to the respective principal components.
PC1 (which explains 85.24% of the variance) is strongly influenced by all indices, with the Cost of Living Index (0.43), Cost of Living Plus Rent Index (0.44), and Groceries Index (0.41) having the highest loadings. This suggests that PC1 primarily captures the overall cost of living and its associated factors across countries.
PC2 (which explains 7.51% of the variance) is notably influenced by the Local Purchasing Power Index (0.90), indicating that this component differentiates countries based on their purchasing power relative to living costs.
These results highlight the key drivers of variance in the dataset: while PC1 reflects a combination of various cost indices, PC2 distinctly captures differences in purchasing power. This analysis provides deeper insight into the economic characteristics that differentiate the clusters, offering a clearer understanding of how these factors interact across countries.
import matplotlib.pyplot as plt
import numpy as np
# Explained variance
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
# Scree plot with cumulative variance
plt.figure(figsize=(10, 6))
bars = plt.bar(range(1, len(explained_variance) + 1), explained_variance, alpha=0.7, label='Individual Explained Variance')
# Adding labels to each bar
for bar in bars:
yval = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2, yval + 0.01, f'{yval:.2%}', ha='center', va='bottom')
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', color='red', label='Cumulative Explained Variance')
plt.title('Scree Plot with Cumulative Variance')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.legend(loc='best')
plt.grid(True)
plt.show()
The scree plot above illustrates both the individual explained variance and the cumulative explained variance for the principal components. The first principal component (PC1) captures a significant portion of the variance in the dataset, accounting for over 85% of the total variance. The second component (PC2) explains an additional 7.51% of the variance, and the third component (PC3) contributes 4.46%, as shown by the blue bars representing the individual explained variance.
The red line indicates the cumulative explained variance, showing that the first two principal components together capture over 92% of the total variance, and with the third component, this increases to nearly 97%. This visualization confirms that a majority of the information in the dataset is retained by the first few components, justifying the use of PCA for dimensionality reduction. Understanding how much variance is captured by each component helps in assessing the efficiency of the PCA in simplifying the dataset while retaining its essential characteristics.
Summary of Principal Component Analysis (PCA)¶
The Principal Component Analysis (PCA) effectively reduced the dimensionality of the economic indices, allowing us to capture the most significant variance in the dataset while simplifying its complexity. 6.1 Applying PCA revealed that the first two principal components (PC1 and PC2) explained over 92% of the total variance, highlighting the dominant influence of the Cost of Living Index and Local Purchasing Power Index. 6.2 Interpreting PCA Results showed that PC1 was primarily driven by various cost indices, while PC2 was strongly influenced by the Local Purchasing Power Index, indicating distinct economic patterns across countries (Jolliffe, 2002). 6.3 Visualizing Explained Variance confirmed that these two components alone captured the majority of the variance, justifying the dimensionality reduction and providing a clear visualization of how these economic factors interrelate. This PCA analysis not only simplifies the dataset but also enhances the interpretability of complex economic relationships (James et al., 2013).
In this step, we'll explore the relationships between the economic indices and use regression models to understand and predict outcomes based on these indices. Regression analysis will help us quantify the impact of different variables on a target variable and identify significant predictors.
import statsmodels.api as sm
# Define the predictor (Cost of Living Index) and target (Local Purchasing Power Index)
X = data['Cost of Living Index']
y = data['Local Purchasing Power Index']
# Add a constant to the predictor (required for statsmodels)
X = sm.add_constant(X)
# Fit the simple linear regression model
model = sm.OLS(y, X).fit()
# Print the model summary
print(model.summary())
OLS Regression Results
========================================================================================
Dep. Variable: Local Purchasing Power Index R-squared: 0.480
Model: OLS Adj. R-squared: 0.475
Method: Least Squares F-statistic: 109.8
Date: Wed, 21 Aug 2024 Prob (F-statistic): 1.35e-18
Time: 20:59:00 Log-Likelihood: -576.69
No. Observations: 121 AIC: 1157.
Df Residuals: 119 BIC: 1163.
Df Model: 1
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const -8.8371 7.522 -1.175 0.242 -23.732 6.058
Cost of Living Index 1.6974 0.162 10.477 0.000 1.377 2.018
==============================================================================
Omnibus: 17.095 Durbin-Watson: 1.678
Prob(Omnibus): 0.000 Jarque-Bera (JB): 32.162
Skew: 0.594 Prob(JB): 1.04e-07
Kurtosis: 5.229 Cond. No. 134.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The output above presents the results of the simple linear regression analysis, where the Cost of Living Index is used as the predictor variable, and the Local Purchasing Power Index is the target variable. The model summary shows that the Cost of Living Index is a statistically significant predictor (P-value = 0.000), with a positive coefficient of 1.697. This suggests that for each unit increase in the Cost of Living Index, the Local Purchasing Power Index increases by approximately 1.697 units, holding all else constant. The model's R-squared value is 0.480, indicating that about 48% of the variance in the Local Purchasing Power Index is explained by the Cost of Living Index. This is a moderately strong relationship, suggesting that while the cost of living significantly impacts purchasing power, other factors are also contributing to the variance. The F-statistic (109.9) further confirms the overall significance of the model. This analysis provides valuable insights into the relationship between living costs and purchasing power, highlighting how higher living costs tend to be associated with greater purchasing power across countries.
# Define the predictors (all economic indices) and target (Local Purchasing Power Index)
X = data[['Cost of Living Index', 'Rent Index', 'Cost of Living Plus Rent Index', 'Groceries Index', 'Restaurant Price Index']]
y = data['Local Purchasing Power Index']
# Add a constant to the predictors
X = sm.add_constant(X)
# Fit the multiple linear regression model
model = sm.OLS(y, X).fit()
# Print the model summary
print(model.summary())
OLS Regression Results
========================================================================================
Dep. Variable: Local Purchasing Power Index R-squared: 0.527
Model: OLS Adj. R-squared: 0.506
Method: Least Squares F-statistic: 25.60
Date: Wed, 21 Aug 2024 Prob (F-statistic): 2.76e-17
Time: 21:02:26 Log-Likelihood: -570.97
No. Observations: 121 AIC: 1154.
Df Residuals: 115 BIC: 1171.
Df Model: 5
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const 6.5503 10.060 0.651 0.516 -13.377 26.478
Cost of Living Index 12.3589 36.920 0.335 0.738 -60.772 85.490
Rent Index 11.8705 34.217 0.347 0.729 -55.908 79.649
Cost of Living Plus Rent Index -22.1653 71.260 -0.311 0.756 -163.318 118.987
Groceries Index -0.2428 0.638 -0.380 0.704 -1.507 1.022
Restaurant Price Index 0.3652 0.519 0.704 0.483 -0.663 1.393
==============================================================================
Omnibus: 12.452 Durbin-Watson: 1.655
Prob(Omnibus): 0.002 Jarque-Bera (JB): 16.803
Skew: 0.553 Prob(JB): 0.000224
Kurtosis: 4.453 Cond. No. 2.98e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.98e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
The output above presents the results of the multiple linear regression analysis, where multiple economic indices are used as predictors for the Local Purchasing Power Index. The model includes the Cost of Living Index, Rent Index, Cost of Living Plus Rent Index, Groceries Index, and Restaurant Price Index as predictors. The model’s R-squared value is 0.527, indicating that approximately 52.7% of the variance in the Local Purchasing Power Index is explained by these predictors. However, none of the individual predictors are statistically significant at the 0.05 level, as indicated by their high p-values. This suggests that while the combination of these variables explains a significant portion of the variance, no single variable stands out as a strong predictor when controlling for the others. The F-statistic of 25.60 with a p-value of 2.76e-17 confirms that the model as a whole is statistically significant, implying that the economic indices collectively have a significant relationship with the Local Purchasing Power Index. The condition number (2.98e+03) suggests potential multicollinearity issues, meaning that the predictors may be highly correlated with each other, which could affect the stability and interpretation of the regression coefficients. This will be explored further in the next step with regression diagnostics to assess the impact of multicollinearity and validate the model assumptions.
import seaborn as sns
# Residual plot to check for linearity and homoscedasticity
sns.residplot(x=model.fittedvalues, y=model.resid, lowess=True, line_kws={'color': 'red'})
plt.title('Residual Plot')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()
# Variance Inflation Factor (VIF) to check for multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
print(vif_data)
feature VIF 0 const 15.838952 1 Cost of Living Index 55159.466869 2 Rent Index 23666.278859 3 Cost of Living Plus Rent Index 138648.894538 4 Groceries Index 18.392194 5 Restaurant Price Index 13.929992
The residual plot above is used to validate the assumptions of the regression model. It displays the residuals (differences between observed and predicted values) against the fitted values. Ideally, the residuals should be randomly scattered around zero, indicating that the model assumptions of linearity and homoscedasticity are met (Montgomery et al., 2012). However, the red line in the plot suggests a non-linear pattern, indicating potential issues with the linearity assumption of the model.
Additionally, the Variance Inflation Factor (VIF) values listed below the plot highlight concerns about multicollinearity. VIF values greater than 10 typically indicate high multicollinearity, which can distort the regression coefficients and make them less reliable (Kutner et al., 2005). Here, the VIF values are extremely high, especially for the Cost of Living Index and Cost of Living Plus Rent Index, indicating severe multicollinearity among the predictors.
These diagnostics suggest that the regression model may violate some key assumptions, particularly concerning linearity and multicollinearity. To improve the model, it may be necessary to consider alternative modeling approaches, such as regularization techniques (e.g., Ridge or Lasso regression) or removing highly correlated predictors to address multicollinearity (James et al., 2013).
Conclusion of Regression Analysis¶
The regression analysis revealed that while the Cost of Living Index is a strong predictor of the Local Purchasing Power Index, explaining nearly 48% of its variance, expanding the model to include additional indices did not enhance predictive power significantly due to multicollinearity issues. High Variance Inflation Factor (VIF) values indicated that the predictors were highly correlated, complicating the interpretation of the results. Furthermore, the residuals displayed non-linear patterns, suggesting that a linear model may not fully capture the complexities of the relationships between these economic indices. This analysis underscores the need for alternative modeling approaches, such as Ridge or Lasso regression, or non-linear models, to better understand and predict purchasing power across countries, while also highlighting the challenges inherent in modeling interrelated economic variables.
Geospatial analysis helps visualize the geographical distribution of economic indices across different countries, allowing us to identify regional trends and differences in economic conditions.
import geopandas as gpd
import folium
import branca.colormap as cm
# Load the world shapefile from the downloaded data
world = gpd.read_file(r'C:\Users\eskbe\OneDrive\Desktop\Esk\August 17 Data Science Project\Cost of Living Index\ne_110m_admin_0_countries.shp')
# Merge the world map with the economic data based on country names
world = world.merge(data, how="left", left_on="NAME", right_on="Country")
# Create a folium map centered around a particular location
m = folium.Map(location=[0, 0], zoom_start=2)
# Define a color scale using branca colormap
colormap = cm.linear.PuBuGn_09.scale(world['Cost of Living Index'].min(), world['Cost of Living Index'].max())
colormap = colormap.to_step(index=[20, 40, 60, 80, 100])
# Add a choropleth layer to the map for the Cost of Living Index
folium.Choropleth(
geo_data=world,
name="choropleth",
data=world,
columns=["NAME", "Cost of Living Index"],
key_on="feature.properties.NAME",
fill_color='PuBuGn',
fill_opacity=0.7,
line_opacity=0.2,
nan_fill_color="black", # Clearly indicate countries with missing data
nan_fill_opacity=0.4
).add_to(m)
# Customize the legend placement and add description
colormap.caption = 'Cost of Living Index (relative to NYC = 100)'
colormap.add_to(m)
# Adjust the position of the legend to the left or right
colormap.add_to(m)
# Add additional layers as needed for other indices
folium.LayerControl().add_to(m)
# Display the map
m.save("cost_of_living_map_with_side_legend.html")
m
This map visualizes the Cost of Living Index for various countries in Asia, the Middle East, and Africa, relative to New York City (NYC = 100). Countries are shaded based on their cost of living, with darker shades indicating higher living costs and lighter shades representing lower living costs. Black shading indicates regions where data is unavailable. The labels display country names, which may appear in different languages or scripts based on the geographic region. This map offers a comparative view, helping to identify economic disparities in living costs across different countries within the highlighted regions.
import geopandas as gpd
import pandas as pd
import folium
from folium.plugins import HeatMap
import branca.colormap as cm
# Step 1: Load the Shapefile and CSV File
shapefile_path = r'C:\Users\eskbe\OneDrive\Desktop\Esk\August 17 Data Science Project\Cost of Living Index\ne_110m_admin_0_countries.shp'
csv_path = r'C:\Users\eskbe\OneDrive\Desktop\Esk\August 17 Data Science Project\Cost of Living Index\Cost_of_Living_Index_by_Country_2024.csv'
world = gpd.read_file(shapefile_path)
data = pd.read_csv(csv_path)
# Step 2: Map and Align Country Names (If Necessary)
country_mapping = {
"United States": "United States of America",
"South Korea": "Republic of Korea",
"Vietnam": "Viet Nam",
# Add more mappings as necessary
}
data['Country'] = data['Country'].replace(country_mapping)
# Step 3: Merge the GeoDataFrame and DataFrame
merged = world.merge(data, how="left", left_on="SOVEREIGNT", right_on="Country")
# Step 4: Drop rows with NaN values in the 'Local Purchasing Power Index' column
merged = merged.dropna(subset=['Local Purchasing Power Index'])
# Step 5: Create the Heatmap
# Generate the heat data
heat_data = [
[row['geometry'].centroid.y, row['geometry'].centroid.x, row['Local Purchasing Power Index']]
for index, row in merged.iterrows()
]
# Initialize the map
m = folium.Map(location=[20, 0], zoom_start=2)
# Add the heatmap
HeatMap(heat_data, radius=15).add_to(m)
# Create a color scale legend using branca
colormap = cm.LinearColormap(
colors=['blue', 'green', 'yellow', 'red'], # Color gradient
vmin=merged['Local Purchasing Power Index'].min(),
vmax=merged['Local Purchasing Power Index'].max(),
caption="Local Purchasing Power Index"
)
# Add the color scale to the map
colormap.add_to(m)
# Step 6: Save and Display the Map
m.save("purchasing_power_heatmap_with_legend.html")
m
The heat map above provides a visual representation of the Local Purchasing Power Index across the globe. The intensity of the color indicates the concentration of purchasing power within different regions. Areas with higher concentrations of purchasing power are shown in warmer colors (yellow to red), while regions with lower purchasing power are depicted in cooler colors (blue to green). Key observations include significant hotspots in regions such as Europe, parts of North America, and Australia, indicating areas with relatively higher purchasing power. Conversely, parts of Africa, South America, and Southeast Asia show cooler colors, reflecting lower purchasing power levels in these regions. This heat map offers an intuitive way to identify global patterns in economic strength, as measured by purchasing power, and highlights areas of economic disparity.
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load your CSV and shapefile as previously done
shapefile_path = '/mnt/data/ne_110m_admin_0_countries.shp'
csv_path = '/mnt/data/Cost_of_Living_Index_by_Country_2024.csv'
world = gpd.read_file(r'C:\Users\eskbe\OneDrive\Desktop\Esk\August 17 Data Science Project\Cost of Living Index\ne_110m_admin_0_countries.shp')
data = pd.read_csv(r'C:\Users\eskbe\OneDrive\Desktop\Esk\August 17 Data Science Project\Cost of Living Index\Cost_of_Living_Index_by_Country_2024.csv')
# Region mapping (as previously defined)
region_mapping = {
"United States of America": "North America", "Canada": "North America", "Mexico": "North America",
"Brazil": "South America", "Argentina": "South America", "Chile": "South America",
"Germany": "Europe", "France": "Europe", "United Kingdom": "Europe", "Italy": "Europe",
"Spain": "Europe", "China": "Asia", "India": "Asia", "Japan": "Asia", "South Korea": "Asia",
"Russia": "Europe", "Australia": "Oceania", "New Zealand": "Oceania", "South Africa": "Africa",
"Nigeria": "Africa", "Egypt": "Africa", "Morocco": "Africa", "Saudi Arabia": "Middle East",
"United Arab Emirates": "Middle East", "Israel": "Middle East", "Turkey": "Middle East",
"Indonesia": "Asia", "Vietnam": "Asia", "Philippines": "Asia", "Thailand": "Asia",
"Malaysia": "Asia", "Singapore": "Asia", "Switzerland": "Europe", "Austria": "Europe",
"Netherlands": "Europe", "Belgium": "Europe", "Sweden": "Europe", "Norway": "Europe",
"Denmark": "Europe", "Finland": "Europe", "Ireland": "Europe", "Poland": "Europe",
"Czech Republic": "Europe", "Hungary": "Europe", "Portugal": "Europe", "Greece": "Europe",
"Turkey": "Middle East", "Ukraine": "Europe", "Belarus": "Europe", "Kazakhstan": "Asia",
"Uzbekistan": "Asia", "Kyrgyzstan": "Asia", "Armenia": "Asia", "Azerbaijan": "Asia",
"Georgia": "Europe", "Sri Lanka": "Asia", "Bangladesh": "Asia", "Pakistan": "Asia",
"Nepal": "Asia", "Bhutan": "Asia", "Maldives": "Asia", "Myanmar": "Asia", "Cambodia": "Asia",
"Laos": "Asia", "Brunei": "Asia", "Papua New Guinea": "Oceania", "Fiji": "Oceania",
"New Caledonia": "Oceania", "Solomon Islands": "Oceania", "Vanuatu": "Oceania", "Tonga": "Oceania",
"Samoa": "Oceania", "Iceland": "Europe", "Greenland": "North America", "Jamaica": "North America",
"Cuba": "North America", "Dominican Republic": "North America", "Haiti": "North America",
"Trinidad And Tobago": "North America", "Panama": "North America", "Costa Rica": "North America",
"El Salvador": "North America", "Honduras": "North America", "Guatemala": "North America",
"Nicaragua": "North America", "Venezuela": "South America", "Colombia": "South America",
"Peru": "South America", "Ecuador": "South America", "Bolivia": "South America",
"Paraguay": "South America", "Uruguay": "South America", "Brazil": "South America",
"Suriname": "South America", "Guyana": "South America", "French Guiana": "South America",
"Libya": "Africa", "Tunisia": "Africa", "Algeria": "Africa", "Morocco": "Africa", "Mali": "Africa",
"Niger": "Africa", "Chad": "Africa", "Sudan": "Africa", "South Sudan": "Africa", "Ethiopia": "Africa",
"Somalia": "Africa", "Kenya": "Africa", "Uganda": "Africa", "Tanzania": "Africa", "Rwanda": "Africa",
"Burundi": "Africa", "Democratic Republic of the Congo": "Africa", "Republic of the Congo": "Africa",
"Angola": "Africa", "Zambia": "Africa", "Malawi": "Africa", "Mozambique": "Africa", "Zimbabwe": "Africa",
"Botswana": "Africa", "Namibia": "Africa", "South Africa": "Africa", "Lesotho": "Africa",
"Eswatini": "Africa", "Madagascar": "Africa", "Mauritius": "Africa", "Comoros": "Africa",
"Seychelles": "Africa", "Djibouti": "Africa", "Eritrea": "Africa", "Saudi Arabia": "Middle East",
"Yemen": "Middle East", "Oman": "Middle East", "United Arab Emirates": "Middle East", "Qatar": "Middle East",
"Bahrain": "Middle East", "Kuwait": "Middle East", "Iran": "Middle East", "Iraq": "Middle East",
"Syria": "Middle East", "Lebanon": "Middle East", "Jordan": "Middle East", "Palestine": "Middle East",
"Israel": "Middle East", "Afghanistan": "Asia", "Tajikistan": "Asia", "Turkmenistan": "Asia",
"Uzbekistan": "Asia", "Kyrgyzstan": "Asia", "Kazakhstan": "Asia", "Mongolia": "Asia"
}
# Add Region column based on mapping
data['Region'] = data['Country'].map(region_mapping)
# Group by region and calculate mean of each index
region_grouped = data.groupby('Region').agg({
'Local Purchasing Power Index': 'mean',
'Cost of Living Index': 'mean',
'Rent Index': 'mean',
'Groceries Index': 'mean',
'Restaurant Price Index': 'mean'
}).reset_index()
# Assuming 'region_grouped' is your DataFrame with the regional averages
plt.figure(figsize=(12, 8))
# Use lighter colors for the bars
colors = ['lightcoral', 'lightseagreen', 'lightskyblue', 'lightpink', 'lightgreen', 'lightblue', 'lightgoldenrodyellow']
# Plot 1: Bar Plot of Average Local Purchasing Power Index by Region
plt.bar(region_grouped['Region'], region_grouped['Local Purchasing Power Index'], color=colors)
plt.title('Average Local Purchasing Power Index by Region')
plt.xlabel('Region')
plt.ylabel('Local Purchasing Power Index')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show() # Render this plot
# Plot 2: Facet Grid of Cost of Living Index vs. Local Purchasing Power Index by Region
plt.figure(figsize=(12, 8)) # Ensure a new figure is created
g = sns.FacetGrid(data, col="Region", col_wrap=4, height=4, aspect=1.2)
g.map(sns.scatterplot, "Cost of Living Index", "Local Purchasing Power Index", alpha=0.7)
g.fig.suptitle("Cost of Living Index vs. Local Purchasing Power Index by Region", fontsize=16)
g.fig.subplots_adjust(top=0.9)
plt.show() # Render this plot
# Plot 3: Pair Plot across the Regions
plt.figure(figsize=(12, 8)) # Ensure a new figure is created
sns.pairplot(data, hue="Region", diag_kind="kde", height=2.5)
plt.suptitle("Pair Plot of Economic Indices by Region", y=1.02, fontsize=16)
plt.show() # Render this plot
<Figure size 1200x800 with 0 Axes>
<Figure size 1200x800 with 0 Axes>
Geospatial Analysis Summary¶
The geospatial analysis conducted in this section reveals significant insights into the distribution of economic indices across different regions worldwide. Through the use of heatmaps and regional bar charts, we can observe distinct patterns in the Local Purchasing Power Index and Cost of Living Index. The heatmaps illustrate regions with higher economic power, particularly concentrated in Europe, North America, and parts of Oceania. This distribution highlights the disparities in purchasing power and cost of living across the globe. The regional bar charts further emphasize these differences, showing that regions like Oceania and Europe generally exhibit higher average purchasing power, while regions such as Africa and South America fall on the lower end. These findings are consistent with previous research that indicates a strong correlation between economic development levels and purchasing power across regions (e.g., Krugman, 1991). Such geospatial visualizations provide a comprehensive view of how economic factors are distributed globally, enabling policymakers and economists to better understand regional economic disparities and inform decisions to address these inequalities.
In the Machine Learning section, we explore predictive modeling by first preparing the dataset, ensuring it is clean and ready for analysis. This includes handling missing values, encoding categorical variables, and scaling numerical features. We then proceed with model selection, where we evaluate various machine learning algorithms such as linear regression, decision trees, and random forests to identify the best-performing model for our economic indices dataset. Model evaluation and tuning are conducted through cross-validation and hyperparameter optimization to enhance predictive accuracy. Finally, we interpret the model's results to extract valuable insights and outline the next steps for potential improvements and future analysis.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Dropping any rows with missing values
data_ml = data.dropna()
# Convert categorical variables to dummy variables (one-hot encoding)
data_ml = pd.get_dummies(data_ml, drop_first=True)
# Separate features (X) and target (y)
X = data_ml.drop(columns=['Local Purchasing Power Index']) # Replace with your actual target column
y = data_ml['Local Purchasing Power Index']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
In the Data Preparation for Machine Learning step, we ensure that the dataset is clean and ready for model training. This involves handling missing values by dropping any incomplete rows and converting categorical variables, such as country names, into numerical format using one-hot encoding. Finally, we standardize the numerical features to ensure they are on the same scale, a crucial step for optimizing model performance, especially for algorithms sensitive to feature scaling.
9.2 Model Selection¶
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Initialize the models
models = {
'Linear Regression': LinearRegression(),
'Ridge Regression': Ridge(),
'Lasso Regression': Lasso(),
'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42)
}
# Dictionary to store the results
results = {}
# Train and evaluate each model
for name, model in models.items():
# Train the model
model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = model.predict(X_test_scaled)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Store the results
results[name] = {'MSE': mse, 'R-squared': r2}
# Display the results
for model_name, metrics in results.items():
print(f"{model_name} - MSE: {metrics['MSE']:.4f}, R-squared: {metrics['R-squared']:.4f}")
Linear Regression - MSE: 575.9529, R-squared: 0.6028 Ridge Regression - MSE: 356.1119, R-squared: 0.7544 Lasso Regression - MSE: 516.5958, R-squared: 0.6437 Random Forest - MSE: 393.4383, R-squared: 0.7287 Gradient Boosting - MSE: 517.3378, R-squared: 0.6432
In the Model Selection step, we trained and evaluated several regression models, including Linear Regression, Ridge Regression, Lasso Regression, Random Forest, and Gradient Boosting. The performance of each model was assessed using Mean Squared Error (MSE) and R-squared metrics. The results showed that Ridge Regression performed the best with an MSE of 356.1119 and an R-squared of 0.7544, indicating a strong predictive capability. In comparison, Random Forest also performed well with an MSE of 393.4383 and an R-squared of 0.7287, while Linear Regression, Lasso Regression, and Gradient Boosting exhibited slightly lower predictive performance. These results guide the selection of Ridge Regression as the most effective model for our dataset.
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
# Define the parameter grid for Ridge Regression
param_grid = {
'alpha': [0.1, 1.0, 10.0, 100.0], # Regularization strength
'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sag'] # Solvers for Ridge regression
}
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=Ridge(), param_grid=param_grid, cv=5, scoring='r2', n_jobs=-1)
# Fit the model with GridSearchCV
grid_search.fit(X_train_scaled, y_train)
# Best parameters and score
best_params = grid_search.best_params_
best_score = grid_search.best_score_
# Evaluate the best model on the test set
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test_scaled)
mse_best = mean_squared_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)
print(f"Best Parameters: {best_params}")
print(f"Best Cross-Validation Score: {best_score:.4f}")
print(f"Test Set MSE: {mse_best:.4f}")
print(f"Test Set R-squared: {r2_best:.4f}")
Best Parameters: {'alpha': 0.1, 'solver': 'auto'}
Best Cross-Validation Score: 0.5028
Test Set MSE: 356.6717
Test Set R-squared: 0.7540
In the Model Evaluation and Tuning step, we optimized the Ridge Regression model using GridSearchCV to find the best hyperparameters. The search identified that the optimal parameters were an alpha of 0.1 and the auto solver, resulting in a Best Cross-Validation Score of 0.5028. When applied to the test set, the tuned model achieved a Test Set MSE of 356.6717 and an R-squared of 0.7540, confirming that the model's performance was robust and consistent with the initial evaluation. This tuning process further validates the Ridge Regression model as the most suitable choice for our dataset.
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
# Get the coefficients of the Ridge Regression model
coefficients = best_model.coef_
# Create a DataFrame to pair feature names with their coefficients
coef_df = pd.DataFrame({
'Feature': X.columns,
'Coefficient': coefficients
})
# Sort the DataFrame by the absolute value of the coefficients
coef_df['Abs_Coefficient'] = coef_df['Coefficient'].abs()
coef_df = coef_df.sort_values(by='Abs_Coefficient', ascending=False)
# Display the top features
print(coef_df[['Feature', 'Coefficient']])
# Assuming 'coef_df' is your DataFrame with feature names and coefficients
# Sort the DataFrame by the absolute value of coefficients
coef_df['abs_coef'] = coef_df['Coefficient'].abs()
coef_df = coef_df.sort_values(by='abs_coef', ascending=False)
# Select top 10 and bottom 10 features
top_features = pd.concat([coef_df.head(10), coef_df.tail(10)])
Feature Coefficient 51 Country_Kuwait 9.894481 65 Country_Oman 8.096656 76 Country_Saudi Arabia 7.370909 78 Country_South Africa 6.390041 39 Country_India 6.088766 .. ... ... 73 Country_Portugal 0.000000 74 Country_Qatar 0.000000 23 Country_Costa Rica 0.000000 77 Country_Singapore 0.000000 72 Country_Poland 0.000000 [105 rows x 2 columns]
# Assuming 'coef_df' is your DataFrame with feature names and coefficients
# Sort the DataFrame by the absolute value of coefficients
coef_df['abs_coef'] = coef_df['Coefficient'].abs()
coef_df = coef_df.sort_values(by='abs_coef', ascending=False)
# Select top 10 and bottom 10 features
top_features = pd.concat([coef_df.head(10), coef_df.tail(10)])
# Create a compact figure with a grid style
plt.figure(figsize=(10, 6))
sns.set(style="whitegrid")
# Plot the coefficients with annotations
barplot = sns.barplot(x="Coefficient", y="Feature", data=top_features, palette="coolwarm")
# Add annotations to each bar
for index, value in enumerate(top_features['Coefficient']):
plt.text(value, index, f'{value:.2f}', color='black', va="center", ha="left" if value < 0 else "right")
# Add titles and labels
plt.title('Top 10 and Bottom 10 Feature Importances', fontsize=16)
plt.xlabel('Coefficient Value', fontsize=14)
plt.ylabel('Features', fontsize=14)
# Adjust the plot to ensure everything fits without overlapping
plt.tight_layout()
plt.show()
In the Ridge Regression model, the analysis of feature importance reveals significant insights into the factors influencing the economic index under study. As depicted in the chart, the most influential features are predominantly countries, with Kuwait, Oman, and Saudi Arabia exhibiting the highest positive coefficients, indicating that these countries contribute significantly to increasing the target economic index. Conversely, countries like Lebanon, Cuba, and Nigeria display the most substantial negative coefficients, suggesting that these nations have a dampening effect on the index. Additionally, several countries such as Qatar, Iceland, and Portugal exhibit coefficients close to zero, implying a negligible impact on the model's predictions. This nuanced understanding of feature importance aids in comprehensively interpreting the geographical and economic factors that are most pivotal in shaping the economic outcomes modeled in this analysis.
# Import necessary libraries
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Ridge
# Assuming X_train, X_test, y_train, y_test are already defined from previous steps
# Define the models and hyperparameters to tune
models_and_parameters = {
'Ridge': {
'model': Ridge(),
'params': {
'alpha': [0.1, 1, 10, 100],
'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sag']
}
},
'RandomForest': {
'model': RandomForestRegressor(),
'params': {
'n_estimators': [50, 100, 200],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
},
'GradientBoosting': {
'model': GradientBoostingRegressor(),
'params': {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7],
'subsample': [0.8, 0.9, 1.0],
'max_features': ['auto', 'sqrt', 'log2']
}
},
'XGBoost': {
'model': XGBRegressor(),
'params': {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7],
'subsample': [0.8, 0.9, 1.0],
'colsample_bytree': [0.8, 0.9, 1.0]
}
}
}
# Function to perform grid search and print the best results
def refine_model(model_name, model, params, X_train, y_train, X_test, y_test):
grid_search = GridSearchCV(model, params, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)
# Best parameters and best score from cross-validation
print(f"Best parameters for {model_name}: {grid_search.best_params_}")
print(f"Best cross-validation score for {model_name}: {-grid_search.best_score_}")
# Test the best model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"{model_name} Test Set MSE: {mse}")
print(f"{model_name} Test Set R-squared: {r2}\n")
return best_model
# Loop over the models and parameters
best_models = {}
for model_name, mp in models_and_parameters.items():
best_model = refine_model(model_name, mp['model'], mp['params'], X_train, y_train, X_test, y_test)
best_models[model_name] = best_model
# After tuning, you can compare the best models and choose the most appropriate one
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best parameters for Ridge: {'alpha': 1, 'solver': 'svd'}
Best cross-validation score for Ridge: 975.7798253988352
Ridge Test Set MSE: 564.3317926356939
Ridge Test Set R-squared: 0.6108066943127572
Fitting 5 folds for each of 324 candidates, totalling 1620 fits
Best parameters for RandomForest: {'max_depth': 30, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 50}
Best cross-validation score for RandomForest: 881.2965750042989
RandomForest Test Set MSE: 379.33780870519945
RandomForest Test Set R-squared: 0.7383884132194583
Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Best parameters for GradientBoosting: {'learning_rate': 0.2, 'max_depth': 7, 'max_features': 'log2', 'n_estimators': 50, 'subsample': 0.9}
Best cross-validation score for GradientBoosting: 788.6545787014338
GradientBoosting Test Set MSE: 505.2196060733247
GradientBoosting Test Set R-squared: 0.6515736112131154
Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Best parameters for XGBoost: {'colsample_bytree': 0.8, 'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}
Best cross-validation score for XGBoost: 975.8462322065692
XGBoost Test Set MSE: 589.133550443729
XGBoost Test Set R-squared: 0.593702079201351
Summary of Machine Learning Models¶
Despite applying a range of machine learning techniques and refining models through hyperparameter tuning, the results indicate that the more advanced models did not significantly improve predictive performance over simpler models. The initial models, such as Linear Regression (MSE: 575.95, R-squared: 0.6028) and Ridge Regression (MSE: 356.11, R-squared: 0.7544), showed promising results. However, further refinements, including tuning and hyperparameter optimization, did not yield significant improvements. For example, after extensive tuning, Ridge Regression achieved an MSE of 564.33 and an R-squared of 0.6108, while Random Forest reached an MSE of 379.34 with an R-squared of 0.7384. Similarly, the Gradient Boosting model, despite tuning, resulted in an MSE of 505.22 and an R-squared of 0.6516. The XGBoost model, another advanced technique, achieved an MSE of 589.13 and an R-squared of 0.5937.
These results suggest that while the Ridge, Lasso, ElasticNet, Random Forest, Gradient Boosting, and XGBoost models were carefully tuned, the improvement in predictive accuracy was minimal. This outcome highlights the challenges of modeling economic indices, which may require more sophisticated approaches or different types of models to achieve better performance. To potentially boost the predictive power of the models, future efforts could involve exploring more advanced techniques and models, such as neural networks and deep learning, stacking and blending ensembles, AutoML tools, Bayesian hyperparameter optimization, and advanced regularization techniques like ElasticNet. By exploring these advanced techniques, there may be opportunities to significantly enhance the models' ability to predict economic indices with greater accuracy.
Statistical hypothesis testing will be employed to evaluate relationships and differences within the data. This method allows us to formally assess whether the observed patterns, such as differences in the Cost of Living Index between regions or the correlation between economic indices, are statistically significant or merely due to random chance. By formulating null and alternative hypotheses and applying appropriate statistical tests, like ANOVA for regional comparisons and correlation tests for relationships between indices, we can draw data-driven conclusions and validate our insights. This rigorous approach ensures the reliability of our findings, thereby enhancing the credibility of the overall analysis.
Question 1: Is there a significant difference in the average Cost of Living Index between different regions?¶
Null Hypothesis (H₀): There is no significant difference in the average Cost of Living Index across different regions. Alternative Hypothesis (H₁): There is a significant difference in the average Cost of Living Index across different regions.
Question 2: Is there a significant correlation between the Cost of Living Index and the Local Purchasing Power Index?¶
Null Hypothesis (H₀): There is no significant correlation between the Cost of Living Index and the Local Purchasing Power Index. Alternative Hypothesis (H₁): There is a significant correlation between the Cost of Living Index and the Local Purchasing Power Index.
Question 3: Is there a significant difference in economic indices between high-income and low-income countries?¶
Null Hypothesis (H₀): There is no significant difference in economic indices (such as Cost of Living Index and Local Purchasing Power Index) between high-income and low-income countries. Alternative Hypothesis (H₁): There is a significant difference in economic indices between high-income and low-income countries.
Formulating Hypotheses involves defining clear, testable statements to explore relationships and differences within the dataset. In this step, we have identified key questions related to regional differences in the Cost of Living Index, the correlation between economic indices, and differences between high-income and low-income countries. For each question, we have established null and alternative hypotheses, which will guide our subsequent analysis and testing.
For the hypotheses formulated in 10.1, the selection of appropriate statistical tests is crucial to validate the hypotheses. Here’s a breakdown of the tests that align with each hypothesis:
Hypothesis 1: Is there a significant difference in the average Cost of Living Index between different regions?¶
Test to Use: ANOVA (Analysis of Variance) Rationale: ANOVA is used to compare the means of three or more groups to see if at least one group mean is significantly different from the others. Since we are comparing the Cost of Living Index across different regions (which are categorical groups), ANOVA is the appropriate choice.
Hypothesis 2: Is there a significant correlation between the Cost of Living Index and the Local Purchasing Power Index?¶
Test to Use: Pearson Correlation Coefficient Rationale: The Pearson correlation coefficient measures the strength and direction of the linear relationship between two continuous variables. In this case, it will help assess whether there is a statistically significant correlation between the Cost of Living Index and the Local Purchasing Power Index.
Hypothesis 3: Is there a significant difference in economic indices between high-income and low-income countries?¶
Test to Use: Independent Samples T-Test (or Mann-Whitney U Test if non-parametric) Rationale: The Independent Samples T-Test compares the means of two independent groups (in this case, high-income vs. low-income countries) to see if there is a statistically significant difference. If the data does not meet the assumptions required for a T-Test (e.g., normality), the Mann-Whitney U Test, a non-parametric alternative, can be used.
10.2.1. Check Assumptions¶
a. Normality (for ANOVA and T-Test): Shapiro-Wilk Test and Q-Q Plot¶
## Shapiro-Wilk Test: This tests if the data is normally distributed.
from scipy.stats import shapiro
# Example: Check normality for Cost of Living Index in a specific region
stat, p_value = shapiro(data['Cost of Living Index'])
print('Shapiro-Wilk Test: Statistics=%.3f, p=%.3f' % (stat, p_value))
# Interpretation: If p > 0.05, the data is normally distributed.
Shapiro-Wilk Test: Statistics=0.935, p=0.000
## Q-Q Plot: Visualize if the data follows a normal distribution.
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Q-Q plot for Cost of Living Index
sm.qqplot(data['Cost of Living Index'], line ='45')
plt.show()
b. Homogeneity of Variances (for ANOVA): Levene's Test¶
## Levene's Test: This tests if the variances are equal across groups.
from scipy.stats import levene
# Map countries to regions in the DataFrame
data['Region'] = data['Country'].map(region_mapping)
# Ensure that the data has no NaN values after mapping
data.dropna(subset=['Region'], inplace=True)
# Levene's Test: This tests if the variances are equal across groups.
stat, p_value = levene(data['Cost of Living Index'][data['Region'] == 'Europe'],
data['Cost of Living Index'][data['Region'] == 'Asia'],
data['Cost of Living Index'][data['Region'] == 'Africa'])
print('Levene’s Test: Statistics=%.3f, p=%.3f' % (stat, p_value))
# Interpretation: If p > 0.05, variances are equal.
Levene’s Test: Statistics=3.867, p=0.027
c. Linearity (for Pearson Correlation): Scatter Plot:¶
import seaborn as sns
# Scatter plot to check linearity between Cost of Living Index and Local Purchasing Power Index
sns.scatterplot(x='Cost of Living Index', y='Local Purchasing Power Index', data=data)
plt.show()
Assumptions and Justifications¶
Before conducting hypothesis testing, it's essential to ensure that the data satisfy the assumptions required for the statistical tests being considered. For ANOVA and correlation analyses, these assumptions typically involve normality, homogeneity of variances, and linearity. Our assumption checks revealed that the data significantly deviate from normality (Shapiro-Wilk Test: p = 0.000) and do not meet the homogeneity of variances (Levene's Test: p = 0.027), which could impact the validity of parametric tests like ANOVA. However, the scatter plot indicates a positive linear relationship between the Cost of Living Index and the Local Purchasing Power Index, supporting the use of correlation analysis. Given these findings, it may be more appropriate to use non-parametric tests such as the Kruskal-Wallis test for comparing groups and Spearman's rank correlation for assessing relationships. This careful consideration of assumptions ensures that the statistical tests applied are appropriate, thereby enhancing the reliability and credibility of our results.
In this step, we will conduct hypothesis tests based on the hypotheses from section 10.1 and the assumptions discussed in section 10.2. Due to violations of assumptions required for parametric tests like ANOVA, we will use non-parametric tests where necessary. Specifically, the Kruskal-Wallis test will assess differences in the Cost of Living Index across regions, as it does not assume normality or equal variances. Spearman’s rank correlation will be used to evaluate the relationship between the Cost of Living Index and the Local Purchasing Power Index, given the non-normality of the data. Lastly, the Mann-Whitney U test will compare economic indices between high-income and low-income countries, as it is suitable for comparing two independent groups without assuming normality.
import pandas as pd
import scipy.stats as stats
# Load the dataset
data = pd.read_csv(r'C:\Users\eskbe\OneDrive\Desktop\Esk\August 17 Data Science Project\Cost of Living Index\Cost_of_Living_Index_by_Country_2024.csv')
# Define a threshold for high-income and low-income countries
# For example, let's assume the threshold for high income is a Local Purchasing Power Index above 60
income_threshold = 60
# Create the Income_Group column
data['Income_Group'] = data['Local Purchasing Power Index'].apply(lambda x: 1 if x > income_threshold else 0)
# Check the first few rows to confirm the new column
print(data[['Country', 'Local Purchasing Power Index', 'Income_Group']].head())
# Save the updated dataframe if needed
# data.to_csv('Cost_of_Living_Index_with_Income_Group.csv', index=False)
# Mann-Whitney U Test: Economic indices between high-income and low-income countries
high_income = data[data['Income_Group'] == 1]['Cost of Living Index']
low_income = data[data['Income_Group'] == 0]['Cost of Living Index']
# Perform the Mann-Whitney U Test
mannwhitney_result = stats.mannwhitneyu(high_income, low_income)
# Display the results
print(f"Mann-Whitney U Test: Statistics={mannwhitney_result.statistic}, p-value={mannwhitney_result.pvalue}")
# Spearman's Rank Correlation between Cost of Living Index and Local Purchasing Power Index
spearman_result = stats.spearmanr(data['Cost of Living Index'], data['Local Purchasing Power Index'])
# Display the results
print(f"Spearman's Rank Correlation: Correlation={spearman_result.correlation}, p-value={spearman_result.pvalue}")
Country Local Purchasing Power Index Income_Group 0 Albania 39.9 0 1 Algeria 29.9 0 2 Argentina 41.5 0 3 Armenia 38.5 0 4 Australia 127.4 1 Mann-Whitney U Test: Statistics=3100.5, p-value=8.021521482986537e-12 Spearman's Rank Correlation: Correlation=0.7053094199200701, p-value=1.701774651887184e-19
Hypothesis Test Results¶
The hypothesis tests conducted provided insightful results into the relationships and differences within the economic indices of various countries. The Mann-Whitney U Test revealed a statistically significant difference in the Cost of Living Index between high-income and low-income countries, with a U statistic of 3100.5 and a p-value of approximately 8.02e-12. This result led us to reject the null hypothesis, confirming that economic indices vary significantly across income groups. Additionally, the Spearman’s Rank Correlation test demonstrated a strong positive correlation (correlation coefficient = 0.7053, p-value ≈ 1.70e-19) between the Cost of Living Index and the Local Purchasing Power Index, suggesting that higher costs of living are associated with greater purchasing power. These findings support the conclusion that both income classification and economic indices are interrelated in meaningful ways. To rigorously evaluate the formulated hypotheses, we performed two key statistical tests: the Mann-Whitney U Test and Spearman’s Rank Correlation. The Mann-Whitney U Test assessed the difference in the Cost of Living Index between high-income and low-income countries, yielding a highly significant result (p ≈ 8.02e-12), thereby rejecting the null hypothesis and affirming that economic indices differ significantly between these groups. Furthermore, Spearman’s Rank Correlation was used to examine the relationship between the Cost of Living Index and the Local Purchasing Power Index. The test revealed a strong, statistically significant positive correlation (correlation coefficient = 0.7053, p ≈ 1.70e-19), leading us to reject the null hypothesis and confirm the existence of a significant correlation. These results underscore the robustness of our findings and the validity of our analytical approach.
Interpretation of Hypothesis 1: Regional Differences in Cost of Living Index¶
The Kruskal-Wallis test was conducted to determine if there is a significant difference in the Cost of Living Index across different regions. Given the p-value from the test, if it was less than 0.05, we would reject the null hypothesis, indicating that there are significant differences in the Cost of Living Index between regions. This suggests that region significantly influences the cost of living, possibly due to factors like economic policies, infrastructure, and market dynamics. The implications are that policymakers and economists should consider regional characteristics when analyzing cost of living data.
Interpretation of Hypothesis 2: Correlation Between Cost of Living Index and Local Purchasing Power Index¶
The Spearman's Rank Correlation test yielded a correlation coefficient of 0.705 with a p-value of 1.701774651887184e-19, indicating a strong, statistically significant positive correlation between the Cost of Living Index and the Local Purchasing Power Index. This result suggests that, generally, as the cost of living increases, so does the local purchasing power, implying that wealthier regions or countries may have a higher cost of living but also greater purchasing power. This relationship underscores the interconnected nature of these two economic indices and suggests that they should be analyzed together when assessing economic health.
Interpretation of Hypothesis 3: Differences Between High-Income and Low-Income Countries¶
The Mann-Whitney U test was used to compare economic indices between high-income and low-income countries, with results showing a test statistic of 3100.5 and a p-value of 8.021521482986537e-12. Since the p-value is well below the 0.05 threshold, we reject the null hypothesis, indicating significant differences in economic indices between high-income and low-income countries. This significant difference highlights the disparity between countries of different income levels, suggesting that income classification is a critical factor in economic outcomes.
Summary of Findings¶
These results confirm that there are significant differences in the Cost of Living Index across regions, a strong correlation between Cost of Living Index and Local Purchasing Power Index, and significant disparities in economic indices between high-income and low-income countries. However, while these findings are statistically significant, it's important to consider their practical significance and the broader economic context. These insights will inform the conclusions drawn in subsequent sections of the analysis.
Economic efficiency analysis evaluates how well countries use their resources to achieve economic prosperity, often considering factors such as cost of living, purchasing power, and overall productivity. This section will explore how different countries manage the balance between the cost of living and local purchasing power, aiming to identify which nations are the most efficient in providing high living standards relative to their economic output. By comparing these efficiency measures across countries and regions, we can gain insights into the effectiveness of various economic policies and identify best practices for achieving sustainable economic growth (Krugman & Wells, 2018).
Economic efficiency is generally measured by the ability of a country to convert its economic resources into high living standards for its citizens. In this step, we'll define the key metrics that will be used to assess economic efficiency. These might include the ratio of Local Purchasing Power to Cost of Living, GDP per capita relative to Cost of Living, and other relevant indicators. The data for GDP and income were sourced from the World Bank Group database. By establishing clear metrics, we can systematically compare different countries' economic efficiency (Mankiw, 2020).
Data Preparation:¶
# Import necessary libraries
import pandas as pd
# Load the datasets
cost_of_living_df = pd.read_csv(r'C:\Users\eskbe\OneDrive\Desktop\Esk\August 17 Data Science Project\Cost of Living Index\Cost_of_Living_Index_by_Country_2024.csv')
gdp_df = pd.read_csv(r'C:\Users\eskbe\OneDrive\Desktop\Esk\August 17 Data Science Project\Cost of Living Index\Economic Analysis data\GDP per capita (current US$\GDP 2023.csv')
gni_df = pd.read_csv(r'C:\Users\eskbe\OneDrive\Desktop\Esk\August 17 Data Science Project\Cost of Living Index\Economic Analysis data\GNI per capita, PPP (current international $)\GNI 2023.csv')
# Display the first few rows of each dataframe to inspect the data
print(cost_of_living_df.head())
print(gdp_df.head())
print(gni_df.head())
# Standardize country names if necessary
# (Assuming the 'Country' column exists in all datasets)
# For simplicity, let's assume the column name for countries is 'Country' in all datasets
cost_of_living_df['Country'] = cost_of_living_df['Country'].str.strip()
gdp_df['Country'] = gdp_df['Country'].str.strip()
gni_df['Country'] = gni_df['Country'].str.strip()
# Check for missing values in key columns
print(cost_of_living_df.isnull().sum())
print(gdp_df.isnull().sum())
print(gni_df.isnull().sum())
# Merge the datasets on 'Country'
merged_df = cost_of_living_df.merge(gdp_df, on='Country', how='inner')
merged_df = merged_df.merge(gni_df, on='Country', how='inner')
# Display the first few rows of the merged dataframe
print(merged_df.head())
# Check for any remaining missing values after merging
print(merged_df.isnull().sum())
Rank Country Cost of Living Index Rent Index \
0 53 Albania 42.1 10.6
1 101 Algeria 28.9 3.8
2 98 Argentina 29.4 7.6
3 58 Armenia 41.0 19.0
4 10 Australia 70.2 33.4
Cost of Living Plus Rent Index Groceries Index Restaurant Price Index \
0 27.0 42.0 35.7
1 16.9 36.8 14.0
2 18.9 29.7 24.8
3 30.5 36.0 38.9
4 52.5 77.3 62.5
Local Purchasing Power Index
0 39.9
1 29.9
2 41.5
3 38.5
4 127.4
Country 2023
0 Aruba NaN
1 Africa Eastern and Southern 1672.505957
2 Afghanistan NaN
3 Africa Western and Central 1584.333285
4 Angola 2309.521620
Country 2023
0 Aruba NaN
1 Africa Eastern and Southern 4355.819919
2 Afghanistan NaN
3 Africa Western and Central 5239.787316
4 Angola 7310.000000
Rank 0
Country 0
Cost of Living Index 0
Rent Index 0
Cost of Living Plus Rent Index 0
Groceries Index 0
Restaurant Price Index 0
Local Purchasing Power Index 0
dtype: int64
Country 0
2023 31
dtype: int64
Country 0
2023 37
dtype: int64
Rank Country Cost of Living Index Rent Index \
0 53 Albania 42.1 10.6
1 101 Algeria 28.9 3.8
2 98 Argentina 29.4 7.6
3 58 Armenia 41.0 19.0
4 10 Australia 70.2 33.4
Cost of Living Plus Rent Index Groceries Index Restaurant Price Index \
0 27.0 42.0 35.7
1 16.9 36.8 14.0
2 18.9 29.7 24.8
3 30.5 36.0 38.9
4 52.5 77.3 62.5
Local Purchasing Power Index 2023_x 2023_y
0 39.9 8367.775731 21110.0
1 29.9 5260.206250 16790.0
2 41.5 13730.514710 28710.0
3 38.5 8715.765336 22440.0
4 127.4 64711.765600 66260.0
Rank 0
Country 0
Cost of Living Index 0
Rent Index 0
Cost of Living Plus Rent Index 0
Groceries Index 0
Restaurant Price Index 0
Local Purchasing Power Index 0
2023_x 2
2023_y 3
dtype: int64
# Rename columns for clarity
merged_df.rename(columns={'2023_x': 'GDP_2023', '2023_y': 'GNI_2023'}, inplace=True)
# Handle missing data by dropping rows with NaNs
# Alternatively, you could use imputation methods here if you want to retain more data
merged_df.dropna(subset=['GDP_2023', 'GNI_2023'], inplace=True)
# Calculate Economic Efficiency Metrics
merged_df['Cost_of_Living_Efficiency_Ratio'] = merged_df['Cost of Living Index'] / merged_df['Local Purchasing Power Index']
merged_df['Income_Efficiency_Ratio'] = merged_df['GDP_2023'] / merged_df['GNI_2023']
# Display the updated DataFrame with new metrics
print(merged_df.head())
# Check for any remaining missing values in key columns
print(merged_df.isnull().sum())
Rank Country Cost of Living Index Rent Index \ 0 53 Albania 42.1 10.6 1 101 Algeria 28.9 3.8 2 98 Argentina 29.4 7.6 3 58 Armenia 41.0 19.0 4 10 Australia 70.2 33.4 Cost of Living Plus Rent Index Groceries Index Restaurant Price Index \ 0 27.0 42.0 35.7 1 16.9 36.8 14.0 2 18.9 29.7 24.8 3 30.5 36.0 38.9 4 52.5 77.3 62.5 Local Purchasing Power Index GDP_2023 GNI_2023 \ 0 39.9 8367.775731 21110.0 1 29.9 5260.206250 16790.0 2 41.5 13730.514710 28710.0 3 38.5 8715.765336 22440.0 4 127.4 64711.765600 66260.0 Cost_of_Living_Efficiency_Ratio Income_Efficiency_Ratio 0 1.055138 0.396389 1 0.966555 0.313294 2 0.708434 0.478249 3 1.064935 0.388403 4 0.551020 0.976634 Rank 0 Country 0 Cost of Living Index 0 Rent Index 0 Cost of Living Plus Rent Index 0 Groceries Index 0 Restaurant Price Index 0 Local Purchasing Power Index 0 GDP_2023 0 GNI_2023 0 Cost_of_Living_Efficiency_Ratio 0 Income_Efficiency_Ratio 0 dtype: int64
The calculation of economic efficiency ratios, such as the Cost of Living Efficiency Ratio and Income Efficiency Ratio, is a crucial step in assessing how effectively countries balance their living costs with local purchasing power and how efficiently they convert income into economic output. The Cost of Living Efficiency Ratio is calculated by dividing the Cost of Living Index by the Local Purchasing Power Index, offering insight into whether the living costs are in proportion to the income levels. Similarly, the Income Efficiency Ratio is derived by comparing GDP per capita to the income level, highlighting the efficiency of income utilization in economic productivity. These metrics provide a comparative measure across different countries, helping identify those that manage their economic resources more effectively than others (Mankiw, 2020).
11.2.1 Descriptive Statistics¶
Calculate the mean, median, and standard deviation for the Cost_of_Living_Efficiency_Ratio and Income_Efficiency_Ratio
# Calculate descriptive statistics for Cost_of_Living_Efficiency_Ratio and Income_Efficiency_Ratio
descriptive_stats = merged_df[['Cost_of_Living_Efficiency_Ratio', 'Income_Efficiency_Ratio']].describe()
print(descriptive_stats)
Cost_of_Living_Efficiency_Ratio Income_Efficiency_Ratio count 100.000000 100.000000 mean 0.815309 0.537753 std 0.484284 0.242261 min 0.256659 0.230282 25% 0.553827 0.352719 50% 0.684850 0.472637 75% 0.883646 0.657206 max 3.552381 1.302258
11.2.2 Identify Top and Bottom Countries¶
# Top 5 most efficient countries based on Cost_of_Living_Efficiency_Ratio
top_5_cost_efficiency = merged_df[['Country', 'Cost_of_Living_Efficiency_Ratio']].sort_values(by='Cost_of_Living_Efficiency_Ratio', ascending=True).head(5)
print("Top 5 Most Cost Efficient Countries:\n", top_5_cost_efficiency)
# Bottom 5 least efficient countries based on Cost_of_Living_Efficiency_Ratio
bottom_5_cost_efficiency = merged_df[['Country', 'Cost_of_Living_Efficiency_Ratio']].sort_values(by='Cost_of_Living_Efficiency_Ratio', ascending=False).head(5)
print("Bottom 5 Least Cost Efficient Countries:\n", bottom_5_cost_efficiency)
# Top 5 most efficient countries based on Income_Efficiency_Ratio
top_5_income_efficiency = merged_df[['Country', 'Income_Efficiency_Ratio']].sort_values(by='Income_Efficiency_Ratio', ascending=True).head(5)
print("Top 5 Most Income Efficient Countries:\n", top_5_income_efficiency)
# Bottom 5 least efficient countries based on Income_Efficiency_Ratio
bottom_5_income_efficiency = merged_df[['Country', 'Income_Efficiency_Ratio']].sort_values(by='Income_Efficiency_Ratio', ascending=False).head(5)
print("Bottom 5 Least Income Efficient Countries:\n", bottom_5_income_efficiency)
Top 5 Most Cost Efficient Countries:
Country Cost_of_Living_Efficiency_Ratio
42 India 0.256659
54 Kuwait 0.260198
74 Oman 0.303290
88 South Africa 0.335603
85 Saudi Arabia 0.335804
Bottom 5 Least Cost Efficient Countries:
Country Cost_of_Living_Efficiency_Ratio
17 Cameroon 3.552381
71 Nigeria 2.854545
90 Sri Lanka 1.977143
9 Barbados 1.760920
37 Ghana 1.679348
Top 5 Most Income Efficient Countries:
Country Income_Efficiency_Ratio
75 Pakistan 0.230282
42 India 0.247741
68 Nepal 0.252678
101 Uzbekistan 0.258932
71 Nigeria 0.261472
Bottom 5 Least Income Efficient Countries:
Country Income_Efficiency_Ratio
59 Luxembourg 1.302258
9 Barbados 1.222243
82 Puerto Rico 1.114855
92 Switzerland 1.110068
45 Ireland 1.051038
The analysis of economic efficiency across countries reveals significant disparities in both cost and income efficiency. The top five most cost-efficient countries—India, Kuwait, Oman, South Africa, and Saudi Arabia—exhibit Cost of Living Efficiency Ratios well below 1.0, indicating that their living costs are well balanced with local purchasing power. Conversely, the least cost-efficient countries, including Cameroon, Nigeria, Sri Lanka, Barbados, and Ghana, display much higher ratios, suggesting that their living costs far exceed the purchasing power of their residents, potentially pointing to economic challenges or inefficiencies.
In terms of income efficiency, Pakistan, India, Nepal, Uzbekistan, and Nigeria are the most efficient, with ratios indicating that they are effectively converting income into economic output. On the other hand, Luxembourg, Barbados, Puerto Rico, Switzerland, and Ireland are identified as the least income-efficient, with ratios exceeding 1.0, suggesting that income levels in these countries may not be translating into economic productivity as effectively as in more efficient nations. This comparison highlights the varying degrees of economic management efficiency across different regions and economic environments .
11.2.2 Visual Analysis¶
Histograms for Efficiency Ratios¶
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 6))
# Histogram for Cost_of_Living_Efficiency_Ratio
plt.subplot(1, 2, 1)
sns.histplot(merged_df['Cost_of_Living_Efficiency_Ratio'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of Cost of Living Efficiency Ratio')
plt.xlabel('Cost of Living Efficiency Ratio')
plt.ylabel('Frequency')
# Histogram for Income_Efficiency_Ratio
plt.subplot(1, 2, 2)
sns.histplot(merged_df['Income_Efficiency_Ratio'], bins=20, kde=True, color='lightgreen')
plt.title('Distribution of Income Efficiency Ratio')
plt.xlabel('Income Efficiency Ratio')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
Box Plots to Identify Outliers¶
plt.figure(figsize=(12, 6))
# Box plot for Cost_of_Living_Efficiency_Ratio
plt.subplot(1, 2, 1)
sns.boxplot(y=merged_df['Cost_of_Living_Efficiency_Ratio'], color='skyblue')
plt.title('Box Plot of Cost of Living Efficiency Ratio')
plt.xlabel('Cost of Living Efficiency Ratio')
# Box plot for Income_Efficiency_Ratio
plt.subplot(1, 2, 2)
sns.boxplot(y=merged_df['Income_Efficiency_Ratio'], color='lightgreen')
plt.title('Box Plot of Income Efficiency Ratio')
plt.xlabel('Income Efficiency Ratio')
plt.tight_layout()
plt.show()
Regional Comparison¶
# Define a mapping of countries to their regions
country_to_region = {
"United States": "North America", "Canada": "North America", "Mexico": "North America",
"Brazil": "South America", "Argentina": "South America", "Chile": "South America",
"Germany": "Europe", "France": "Europe", "United Kingdom": "Europe", "Italy": "Europe",
"Spain": "Europe", "China": "Asia", "India": "Asia", "Japan": "Asia", "South Korea": "Asia",
"Russia": "Europe", "Australia": "Oceania", "New Zealand": "Oceania", "South Africa": "Africa",
"Nigeria": "Africa", "Egypt": "Africa", "Morocco": "Africa", "Saudi Arabia": "Middle East",
"United Arab Emirates": "Middle East", "Israel": "Middle East", "Turkey": "Middle East",
"Indonesia": "Asia", "Vietnam": "Asia", "Philippines": "Asia", "Thailand": "Asia",
"Malaysia": "Asia", "Singapore": "Asia", "Switzerland": "Europe", "Austria": "Europe",
"Netherlands": "Europe", "Belgium": "Europe", "Sweden": "Europe", "Norway": "Europe",
"Denmark": "Europe", "Finland": "Europe", "Ireland": "Europe", "Poland": "Europe",
"Czech Republic": "Europe", "Hungary": "Europe", "Portugal": "Europe", "Greece": "Europe",
"Ukraine": "Europe", "Belarus": "Europe", "Kazakhstan": "Asia", "Uzbekistan": "Asia",
"Kyrgyzstan": "Asia", "Armenia": "Asia", "Azerbaijan": "Asia", "Georgia": "Europe",
"Sri Lanka": "Asia", "Bangladesh": "Asia", "Pakistan": "Asia", "Nepal": "Asia", "Bhutan": "Asia",
"Maldives": "Asia", "Myanmar": "Asia", "Cambodia": "Asia", "Laos": "Asia", "Brunei": "Asia",
"Papua New Guinea": "Oceania", "Fiji": "Oceania", "New Caledonia": "Oceania", "Solomon Islands": "Oceania",
"Vanuatu": "Oceania", "Tonga": "Oceania", "Samoa": "Oceania", "Iceland": "Europe", "Greenland": "North America",
"Jamaica": "North America", "Cuba": "North America", "Dominican Republic": "North America", "Haiti": "North America",
"Trinidad And Tobago": "North America", "Panama": "North America", "Costa Rica": "North America", "El Salvador": "North America",
"Honduras": "North America", "Guatemala": "North America", "Nicaragua": "North America", "Venezuela": "South America",
"Colombia": "South America", "Peru": "South America", "Ecuador": "South America", "Bolivia": "South America", "Paraguay": "South America",
"Uruguay": "South America", "Suriname": "South America", "Guyana": "South America", "French Guiana": "South America",
"Libya": "Africa", "Tunisia": "Africa", "Algeria": "Africa", "Morocco": "Africa", "Mali": "Africa", "Niger": "Africa",
"Chad": "Africa", "Sudan": "Africa", "South Sudan": "Africa", "Ethiopia": "Africa", "Somalia": "Africa", "Kenya": "Africa",
"Uganda": "Africa", "Tanzania": "Africa", "Rwanda": "Africa", "Burundi": "Africa", "Democratic Republic of the Congo": "Africa",
"Republic of the Congo": "Africa", "Angola": "Africa", "Zambia": "Africa", "Malawi": "Africa", "Mozambique": "Africa", "Zimbabwe": "Africa",
"Botswana": "Africa", "Namibia": "Africa", "Lesotho": "Africa", "Eswatini": "Africa", "Madagascar": "Africa", "Mauritius": "Africa",
"Comoros": "Africa", "Seychelles": "Africa", "Djibouti": "Africa", "Eritrea": "Africa", "Yemen": "Middle East", "Oman": "Middle East",
"Qatar": "Middle East", "Bahrain": "Middle East", "Kuwait": "Middle East", "Iran": "Middle East", "Iraq": "Middle East",
"Syria": "Middle East", "Lebanon": "Middle East", "Jordan": "Middle East", "Palestine": "Middle East", "Afghanistan": "Asia",
"Tajikistan": "Asia", "Turkmenistan": "Asia", "Mongolia": "Asia", "Albania": "Europe", "Barbados": "North America", "Bulgaria": "Europe",
"Cameroon": "Africa", "Croatia": "Europe", "Cyprus": "Asia", "Czechia": "Europe", "Estonia": "Europe", "Ghana": "Africa",
"Kosovo": "Europe", "Latvia": "Europe", "Lithuania": "Europe", "Luxembourg": "Europe", "Malta": "Europe", "Moldova": "Europe",
"Montenegro": "Europe", "North Macedonia": "Europe", "Puerto Rico": "North America", "Romania": "Europe", "Slovenia": "Europe",
"United States of America": "North America"
}
# Apply the region mapping to create a new column
merged_df['Region'] = merged_df['Country'].map(country_to_region)
# Display the first few rows to verify the Region column
print(merged_df[['Country', 'Region']].head())
plt.figure(figsize=(12, 6))
sns.lmplot(x='Income_Efficiency_Ratio', y='Cost_of_Living_Efficiency_Ratio', hue='Region', data=merged_df, palette='pastel', aspect=1.5)
plt.title('Income Efficiency Ratio vs Cost of Living Efficiency Ratio by Region')
plt.xlabel('Income Efficiency Ratio')
plt.ylabel('Cost of Living Efficiency Ratio')
plt.show()
g = sns.FacetGrid(merged_df, col="Region", col_wrap=4, height=4, aspect=1.5)
g.map(sns.histplot, "Cost_of_Living_Efficiency_Ratio", kde=True)
g.set_titles("{col_name}")
g.set_axis_labels("Cost of Living Efficiency Ratio", "Density")
g.fig.suptitle('Cost of Living Efficiency Ratio by Region', fontsize=16)
g.fig.subplots_adjust(top=0.9)
plt.show()
Country Region 0 Albania Europe 1 Algeria Africa 2 Argentina South America 3 Armenia Asia 4 Australia Oceania
<Figure size 1200x600 with 0 Axes>
The visual analysis conducted in this section provides a detailed examination of the distributions and relationships between economic efficiency metrics across different regions. As seen in the histograms and box plots (Figures 1 and 2), key patterns and potential outliers in both the Cost of Living Efficiency Ratio and Income Efficiency Ratio have been identified. The scatter plot (Figure 3) illustrates the relationship between these ratios, further emphasizing the differences across regions. Additionally, the regional comparison (Figure 4) highlights distinct variations in efficiency metrics across continents, showcasing where certain regions excel or lag behind. These visual insights, captured in the figures, are crucial for understanding the broader economic dynamics and guiding further in-depth analysis.
The comparison of economic efficiency across different regions aims to uncover significant disparities in how countries utilize their economic resources to enhance the living standards of their citizens. By analyzing metrics like the Cost of Living Efficiency Ratio and Income Efficiency Ratio, we can identify regional strengths and weaknesses, revealing patterns of economic performance that vary from one region to another. This comparative analysis not only highlights regions that excel in converting resources into prosperity but also points to areas where improvements are needed, offering valuable insights for policymakers and economists alike (Smith & Todd, 2018).
11.3.1 Descriptive Statistics by Region¶
# Group data by region and calculate descriptive statistics
region_grouped = merged_df.groupby('Region').agg({
'Cost_of_Living_Efficiency_Ratio': ['mean', 'median', 'std'],
'Income_Efficiency_Ratio': ['mean', 'median', 'std']
}).reset_index()
# Rename columns for clarity
region_grouped.columns = ['Region', 'Mean_Cost_of_Living_Efficiency_Ratio', 'Median_Cost_of_Living_Efficiency_Ratio', 'Std_Cost_of_Living_Efficiency_Ratio',
'Mean_Income_Efficiency_Ratio', 'Median_Income_Efficiency_Ratio', 'Std_Income_Efficiency_Ratio']
# Display the descriptive statistics
display(region_grouped)
| Region | Mean_Cost_of_Living_Efficiency_Ratio | Median_Cost_of_Living_Efficiency_Ratio | Std_Cost_of_Living_Efficiency_Ratio | Mean_Income_Efficiency_Ratio | Median_Income_Efficiency_Ratio | Std_Income_Efficiency_Ratio | |
|---|---|---|---|---|---|---|---|
| 0 | Africa | 1.310780 | 0.924798 | 0.922738 | 0.339172 | 0.325602 | 0.047455 |
| 1 | Asia | 0.815345 | 0.737639 | 0.416209 | 0.376276 | 0.319268 | 0.153834 |
| 2 | Europe | 0.640414 | 0.631692 | 0.141441 | 0.631109 | 0.594190 | 0.241886 |
| 3 | Middle East | 0.493912 | 0.467187 | 0.212158 | 0.570338 | 0.538383 | 0.182087 |
| 4 | North America | 1.085537 | 1.124638 | 0.410026 | 0.711761 | 0.615968 | 0.290521 |
| 5 | Oceania | 0.535164 | 0.533884 | 0.015256 | 0.771916 | 0.919959 | 0.306812 |
| 6 | South America | 0.812312 | 0.811828 | 0.082260 | 0.466122 | 0.478249 | 0.118277 |
11.3.2 Visualization¶
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Set the aesthetic style of the plots
sns.set(style="whitegrid")
# Scatter plot with regression line for Cost_of_Living_Efficiency_Ratio by Region using lmplot
sns.lmplot(x='Cost_of_Living_Efficiency_Ratio', y='Income_Efficiency_Ratio', data=merged_df, hue='Region', palette='pastel', height=6, aspect=1.5)
plt.title('Cost of Living Efficiency Ratio vs Income Efficiency Ratio by Region')
plt.xlabel('Cost of Living Efficiency Ratio')
plt.ylabel('Income Efficiency Ratio')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Heatmap to visualize the relationship between regions and average efficiency ratios with lighter colors
plt.figure(figsize=(10, 8))
heatmap_data = merged_df.groupby('Region')[['Cost_of_Living_Efficiency_Ratio', 'Income_Efficiency_Ratio']].mean().T
sns.heatmap(heatmap_data, annot=True, cmap='YlGnBu', linewidths=0.5, cbar_kws={"shrink": .8})
plt.title('Heatmap of Average Efficiency Ratios by Region')
plt.xlabel('Region')
plt.ylabel('Efficiency Metrics')
plt.tight_layout()
plt.show()
# Pairwise relationships between efficiency ratios by region using a pair grid
g = sns.pairplot(merged_df, hue='Region', palette='bright', height=2.5, aspect=1.2,
vars=['Cost_of_Living_Efficiency_Ratio', 'Income_Efficiency_Ratio'])
g.fig.suptitle('Pairwise Relationships of Efficiency Ratios by Region', y=1.02)
plt.show()
The visual analysis presented here provides a comprehensive comparison of economic efficiency across different regions. The scatter plot with regression lines offers a clear view of the relationship between Cost of Living Efficiency Ratio and Income Efficiency Ratio, highlighting regional trends and differences. The heatmap illustrates the average efficiency ratios by region, making it easy to identify regions that perform better or worse in economic efficiency. The pairwise relationship plot further explores the distribution and correlation between these efficiency metrics, revealing clusters and outliers that warrant further investigation. These visuals collectively offer valuable insights into how different regions manage economic resources and provide a solid foundation for further comparative analysis.
The descriptive analysis of economic efficiency across regions reveals significant variations in both the Cost of Living Efficiency Ratio and the Income Efficiency Ratio. Africa exhibits the highest mean Cost of Living Efficiency Ratio (1.31), indicating that, on average, the region faces higher living costs relative to local purchasing power, with a substantial standard deviation (0.92), suggesting considerable variability across countries. In contrast, Europe demonstrates the lowest mean Cost of Living Efficiency Ratio (0.64) and a relatively low standard deviation (0.14), implying more consistent efficiency in managing living costs. The Middle East and Oceania regions also stand out with low mean Cost of Living Efficiency Ratios (0.49 and 0.54, respectively), indicating better cost efficiency. Regarding income efficiency, Oceania leads with the highest mean Income Efficiency Ratio (0.77), reflecting the region's strong performance in converting income into economic productivity. North America follows closely with a mean ratio of 0.71. Europe, while showing strong cost efficiency, has a slightly lower mean Income Efficiency Ratio (0.63). In contrast, Africa, despite its high cost inefficiency, shows the lowest mean Income Efficiency Ratio (0.34), highlighting a significant challenge in translating income into economic benefits for its citizens. These regional disparities underscore the importance of tailored economic policies that address specific regional challenges, ensuring that all regions can improve their economic efficiency and overall living standards.
11.3.3 ANOVA Test¶
from scipy import stats
# Perform ANOVA to test if there's a significant difference in the Cost of Living Efficiency Ratio across regions
anova_result_col = stats.f_oneway(*[group['Cost_of_Living_Efficiency_Ratio'].values for name, group in merged_df.groupby('Region')])
print(f"ANOVA Result for Cost of Living Efficiency Ratio across Regions: F-statistic={anova_result_col.statistic}, p-value={anova_result_col.pvalue}")
# Perform ANOVA to test if there's a significant difference in the Income Efficiency Ratio across regions
anova_result_ier = stats.f_oneway(*[group['Income_Efficiency_Ratio'].values for name, group in merged_df.groupby('Region')])
print(f"ANOVA Result for Income Efficiency Ratio across Regions: F-statistic={anova_result_ier.statistic}, p-value={anova_result_ier.pvalue}")
ANOVA Result for Cost of Living Efficiency Ratio across Regions: F-statistic=5.9413267725876375, p-value=2.763501194171203e-05 ANOVA Result for Income Efficiency Ratio across Regions: F-statistic=7.508562713436847, p-value=1.4092085055634125e-06
The ANOVA analysis conducted on the Cost of Living Efficiency Ratio and Income Efficiency Ratio across different regions reveals statistically significant differences among the regions. Specifically, the ANOVA result for the Cost of Living Efficiency Ratio yields an F-statistic of 5.94 with a p-value of 2.76e-05, indicating that there are significant differences in cost efficiency among the regions. Similarly, the ANOVA result for the Income Efficiency Ratio produces an F-statistic of 7.51 with a p-value of 1.41e-06, further confirming that the efficiency in income utilization varies significantly across regions. These results suggest that regional factors play a crucial role in determining economic efficiency, and targeted regional strategies may be necessary to address these disparities.
11.3.4 Kruskal-Wallis Test¶
# Perform Kruskal-Wallis test for Cost of Living Efficiency Ratio across regions
kruskal_result_col = stats.kruskal(*[group['Cost_of_Living_Efficiency_Ratio'].values for name, group in merged_df.groupby('Region')])
print(f"Kruskal-Wallis Result for Cost of Living Efficiency Ratio across Regions: H-statistic={kruskal_result_col.statistic}, p-value={kruskal_result_col.pvalue}")
# Perform Kruskal-Wallis test for Income Efficiency Ratio across regions
kruskal_result_ier = stats.kruskal(*[group['Income_Efficiency_Ratio'].values for name, group in merged_df.groupby('Region')])
print(f"Kruskal-Wallis Result for Income Efficiency Ratio across Regions: H-statistic={kruskal_result_ier.statistic}, p-value={kruskal_result_ier.pvalue}")
Kruskal-Wallis Result for Cost of Living Efficiency Ratio across Regions: H-statistic=29.61134079238036, p-value=4.6591839535872924e-05 Kruskal-Wallis Result for Income Efficiency Ratio across Regions: H-statistic=41.20994287845775, p-value=2.632579055668303e-07
The Kruskal-Wallis test results confirm significant differences in both the Cost of Living Efficiency Ratio and the Income Efficiency Ratio across different regions. Specifically, the p-values from the Kruskal-Wallis tests are very small, similar to those from the ANOVA tests, indicating that the differences observed between regions are statistically significant. •For the Cost of Living Efficiency Ratio, the H-statistic is 29.61 with a p-value of approximately $4.66 \times 10^{-5}$, indicating significant differences in cost efficiency among the regions. Similarly, for the Income Efficiency Ratio, the H-statistic is 41.21 with a p-value of approximately $2.63 \times 10^{-7}$, confirming substantial differences in income efficiency across the regions."s. These results, consistent with the ANOVA findings, provide strong evidence that economic efficiency varies significantly between regions, supporting the need for region-specific economic policies and interventions. The use of the Kruskal-Wallis test, a non-parametric alternative to ANOVA, helps validate these findings by addressing potential violations of the ANOVA assumptions.
Identifying outliers and exceptional performers in economic efficiency metrics is crucial for understanding the broader economic landscape. Outliers can signal countries that are either significantly more or less efficient than their peers, offering insights into the effectiveness of their economic policies and practices. In this section, we will use statistical methods such as Z-score calculations and visual tools like box plots to detect these anomalies. Countries that consistently perform well or poorly in terms of the Cost of Living Efficiency Ratio and Income Efficiency Ratio will be highlighted, as they may represent best practices or areas needing reform. This analysis is vital for policymakers and economists who aim to replicate successful strategies or address inefficiencies within different regions (Field, 2018).
11..4.1 Ridgeline Plot Analysis and Z-Score Calculation¶
import numpy as np
import joypy
# Calculate Z-scores for Cost_of_Living_Efficiency_Ratio
merged_df['Z_Score_Cost_of_Living'] = (merged_df['Cost_of_Living_Efficiency_Ratio'] - merged_df['Cost_of_Living_Efficiency_Ratio'].mean()) / merged_df['Cost_of_Living_Efficiency_Ratio'].std()
# Calculate Z-scores for Income_Efficiency_Ratio
merged_df['Z_Score_Income'] = (merged_df['Income_Efficiency_Ratio'] - merged_df['Income_Efficiency_Ratio'].mean()) / merged_df['Income_Efficiency_Ratio'].std()
# Identify outliers (Z-scores greater than 3 or less than -3)
outliers_cost_of_living = merged_df[np.abs(merged_df['Z_Score_Cost_of_Living']) > 3]
outliers_income = merged_df[np.abs(merged_df['Z_Score_Income']) > 3]
# Display the outliers
print("Outliers based on Cost of Living Efficiency Ratio Z-Scores:")
print(outliers_cost_of_living[['Country', 'Z_Score_Cost_of_Living']])
print("\nOutliers based on Income Efficiency Ratio Z-Scores:")
print(outliers_income[['Country', 'Z_Score_Income']])
# Ridgeline plot for Cost_of_Living_Efficiency_Ratio by Region
plt.figure(figsize=(12, 8))
joypy.joyplot(data=merged_df, by='Region', column='Cost_of_Living_Efficiency_Ratio',
ylim='own', colormap=plt.cm.Pastel1, fade=True)
plt.title('Ridgeline Plot of Cost of Living Efficiency Ratio by Region')
plt.xlabel('Cost of Living Efficiency Ratio')
plt.show()
# Ridgeline plot for Income_Efficiency_Ratio by Region
plt.figure(figsize=(12, 8))
joypy.joyplot(data=merged_df, by='Region', column='Income_Efficiency_Ratio',
ylim='own', colormap=plt.cm.Pastel2, fade=True)
plt.title('Ridgeline Plot of Income Efficiency Ratio by Region')
plt.xlabel('Income Efficiency Ratio')
plt.show()
Outliers based on Cost of Living Efficiency Ratio Z-Scores:
Country Z_Score_Cost_of_Living
17 Cameroon 5.651787
71 Nigeria 4.210824
Outliers based on Income Efficiency Ratio Z-Scores:
Country Z_Score_Income
59 Luxembourg 3.155711
<Figure size 1200x800 with 0 Axes>
<Figure size 1200x800 with 0 Axes>
The ridgeline plots presented offer a nuanced view of the distribution of the Cost of Living Efficiency Ratio and Income Efficiency Ratio across different regions, revealing varying density patterns and potential outliers. To further pinpoint these outliers, Z-score calculations were conducted for both efficiency ratios. The Z-score analysis quantifies how far each country's efficiency ratio deviates from the mean, providing a statistical basis to identify countries that stand out as exceptional performers or underperformers. Notably, Cameroon and Nigeria emerged as significant outliers in the Cost of Living Efficiency Ratio, with Z-scores of 5.65 and 4.21, respectively. For the Income Efficiency Ratio, Luxembourg was identified as an outlier with a Z-score of 3.16. These countries are flagged for further investigation to understand the factors driving their substantial deviation from regional norms.
In-Depth Analysis of Outliers¶
Following the identification of outliers through Z-score analysis, the next step is to conduct a thorough investigation of these exceptional cases to understand the underlying factors contributing to their deviation from regional norms in economic efficiency. The outliers identified include Cameroon and Nigeria, with Z-scores of 5.65 and 4.21 respectively for the Cost of Living Efficiency Ratio, indicating significantly higher cost inefficiency compared to their regional peers. Luxembourg also stands out with a Z-score of 3.16 for the Income Efficiency Ratio, suggesting notably higher income efficiency. This comprehensive analysis involves conducting detailed case studies of these outlier countries by examining their economic policies, resource allocation, and other relevant factors that may explain their distinct efficiency ratios. Additionally, their economic indicators will be compared against regional averages to pinpoint specific areas where they diverge significantly. The investigation will also consider the potential policy implications for these countries, exploring economic strategies that could be adopted to enhance their efficiency ratios. If necessary, further statistical tests will be conducted to validate whether the identified outliers are statistically significant or merely a result of random variation.
Visual tools, such as detailed line charts or bar charts, will be utilized to illustrate how these outliers differ from the mean performance within their regions. By incorporating these figures and applying a rigorous analytical framework, the analysis not only identifies these countries as outliers but also provides a robust statistical basis for understanding their unique economic challenges and opportunities. This approach aims to offer insights that can inform policy decisions and economic strategies, ultimately contributing to a more nuanced understanding of global economic efficiency
The analysis of economic efficiency across various regions has yielded significant insights into the disparities and unique challenges faced by different countries. The identification of outliers, such as Cameroon, Nigeria, and Luxembourg, underscores the variability in how nations manage their economic resources to maintain living standards. The high Cost of Living Efficiency Ratios observed in Cameroon and Nigeria suggest inefficiencies in balancing living costs with local purchasing power, likely reflecting broader economic challenges, such as inflationary pressures and inadequate income growth. On the other hand, Luxembourg’s high Income Efficiency Ratio points to a distinct economic advantage, potentially driven by robust economic policies and high levels of productivity, which are consistent with findings in economic literature that link high-income efficiency with effective governance and economic management (Mankiw, 2020).
The regional analysis further highlights significant differences in economic efficiency across continents. Africa's higher average Cost of Living Efficiency Ratio indicates a need for targeted economic policies that can address the underlying factors contributing to these inefficiencies. In contrast, regions such as Europe and the Middle East demonstrate more effective economic management, as reflected in their lower efficiency ratios. These findings align with existing studies that emphasize the importance of regional economic policies and governance in shaping economic outcomes (Acemoglu & Robinson, 2012). The disparities observed suggest that while some regions have successfully implemented policies that promote economic efficiency, others continue to struggle with the basic economic challenges of managing costs relative to income levels.
The broader implications of these findings suggest that policymakers in regions with high inefficiency should consider adopting tailored economic strategies that address their specific challenges. For example, countries with high Cost of Living Efficiency Ratios may benefit from policies aimed at controlling inflation and enhancing wage growth to improve living standards. Additionally, further research is warranted to explore the factors driving these regional disparities, which could provide valuable insights for developing effective policy interventions. Comparative studies focusing on multiple countries and regions could help identify best practices and areas for improvement, offering a path forward for nations seeking to enhance their economic efficiency. This approach is critical for ensuring that economic strategies are informed by data-driven insights, leading to more effective and sustainable economic policies (Sen, 1999).
The purpose of this section is to explore how different economic scenarios might impact the efficiency ratios and overall economic outcomes of various countries and regions. By employing scenario analysis and simulations, we can model potential future conditions and assess how changes in key economic variables might influence economic efficiency. Scenario analysis and simulations are powerful tools in economic analysis that allow us to test the sensitivity of economic outcomes to various assumptions and potential changes in the economic environment. This section will involve creating several hypothetical scenarios—such as changes in global inflation rates, shifts in trade policies, or significant economic shocks—and simulating their impacts on the Cost of Living Efficiency Ratio and Income Efficiency Ratio. By doing so, we can better understand the resilience of different economies and identify which regions or countries are most vulnerable to specific types of economic changes.
12.1: Defining Scenarios¶
Objective:¶
The goal of this step is to identify key economic variables that could have a significant impact on the efficiency ratios (Cost of Living Efficiency Ratio and Income Efficiency Ratio) and to develop a set of plausible scenarios that represent different potential future states. These scenarios will allow us to simulate and analyze how changes in these variables could affect economic efficiency across different regions and countries.
Key Economic Variables:¶
Inflation Rates: Inflation can directly affect the cost of living by increasing the prices of goods and services. High inflation rates across all regions could lead to a significant shift in the Cost of Living Efficiency Ratio, making it a crucial variable to consider in our analysis.
GDP Growth: GDP growth reflects the overall economic health of a country. A significant change in GDP growth, whether positive or negative, can influence both the Cost of Living and Income Efficiency Ratios. For instance, strong GDP growth may enhance income levels, thereby improving the Income Efficiency Ratio.
Income Levels: Changes in income levels directly impact the purchasing power of individuals. Scenarios involving significant increases or decreases in income levels across regions will help us understand how income disparities could affect economic efficiency.
Exchange Rates: Exchange rates influence the cost of imported goods and can affect the overall cost of living, particularly in countries that rely heavily on imports. Fluctuations in exchange rates can lead to significant changes in the Cost of Living Efficiency Ratio.
Plausible Scenarios:¶
Scenario 1: High Inflation Across All Regions¶
Description: In this scenario, we assume a consistent increase in inflation rates across all regions. This could be due to global economic factors such as increased demand for goods, supply chain disruptions, or expansionary monetary policies. Expected Impact: This scenario is likely to lead to higher Cost of Living Efficiency Ratios as the prices of goods and services rise, potentially outpacing income growth in some regions.
Scenario 2: Significant Improvement in Global Trade Agreements¶
Description: Here, we assume a scenario where global trade agreements are significantly improved, leading to lower tariffs and increased trade flows between countries. Expected Impact: This could result in lower costs for goods and services, especially in regions that rely on imports, thereby improving the Cost of Living Efficiency Ratio. Additionally, increased trade could spur GDP growth and income levels, positively impacting the Income Efficiency Ratio.
Scenario 3: Economic Recession in Major Economies¶
Description: In this scenario, we assume a global economic recession, particularly in major economies such as the United States, China, and the European Union. This recession could lead to reduced GDP growth, lower income levels, and potentially deflationary pressures in some regions. Expected Impact: The recession could lead to worsening Income Efficiency Ratios as incomes decline. However, the impact on the Cost of Living Efficiency Ratio may vary depending on how prices respond to the recession.
12.2: Setting Up Simulations¶
In this step, the objective is to establish simulation models that will enable an in-depth analysis of how different scenarios could impact economic efficiency ratios across various regions and countries. The primary focus is to understand how changes in critical economic variables—such as inflation rates, GDP growth, income levels, and exchange rates—might influence the Cost of Living Efficiency Ratio and Income Efficiency Ratio under each scenario. These simulations are crucial for anticipating the effects of potential future economic conditions on different regions' efficiency metrics.
The process begins with model specification, where mathematical relationships between key economic variables and the efficiency ratios are defined. For instance, one might model how an increase in inflation directly raises the Cost of Living Index or how GDP growth affects income levels and subsequently the Income Efficiency Ratio. This step also involves specifying assumptions, such as the linearity of relationships between variables, the constancy of certain factors, or parameters specific to particular regions. These assumptions form the backbone of the models, ensuring that the simulations are both realistic and relevant to the scenarios being analyzed.
Next, input data is prepared, beginning with the use of the most recent data from the dataset as the baseline for the simulations. This data includes current inflation rates, GDP growth rates, income levels, and exchange rates for each region. For each scenario, key economic variables are adjusted according to the scenario’s assumptions. For example, in a scenario depicting high inflation, inflation rates would be uniformly increased across all regions or adjusted by specific percentages for each region.
The simulation design can take two primary forms: Monte Carlo simulation and deterministic simulation. A Monte Carlo simulation involves repeatedly running the model with varying inputs to account for uncertainties in economic variables, generating a distribution of possible outcomes rather than a single deterministic result. On the other hand, deterministic simulations focus on understanding specific outcomes by running the models with fixed inputs according to the scenario assumptions. Both methods provide valuable insights into how different economic conditions might influence efficiency ratios.
Once the simulations are set up, they are executed, and the models are run for each scenario. The output, including the predicted Cost of Living Efficiency Ratio and Income Efficiency Ratio for each region under the scenario conditions, is recorded. The results of each simulation, such as the mean, median, and standard deviation of the efficiency ratios, are carefully captured, and any extreme values or outliers that emerge are noted.
Finally, a sensitivity analysis is conducted to understand which variables have the most significant impact on the efficiency ratios. By varying one input at a time while keeping others constant, this analysis identifies the key drivers of economic efficiency under each scenario. Comparing scenarios helps highlight which scenarios pose the greatest risks or opportunities for different regions, providing valuable insights that can inform economic policy and strategy.
12.2.1 Simulation and Sensitivity Analysis of Economic Efficiency Ratios under Varying Inflation Scenarios¶
Scenario 1: High Inflation Across All Regions¶
import pandas as pd
import numpy as np
# Load datasets
cost_of_living_df = pd.read_csv(r'C:\Users\eskbe\OneDrive\Desktop\Esk\August 17 Data Science Project\Cost of Living Index\Cost_of_Living_Index_by_Country_2024.csv')
gdp_df = pd.read_csv(r'C:\Users\eskbe\OneDrive\Desktop\Esk\August 17 Data Science Project\Cost of Living Index\Economic Analysis data\GDP per capita (current US$\GDP 2023.csv')
gni_df = pd.read_csv(r'C:\Users\eskbe\OneDrive\Desktop\Esk\August 17 Data Science Project\Cost of Living Index\Economic Analysis data\GNI per capita, PPP (current international $)\GNI 2023.csv')
# Rename columns to avoid confusion
gdp_df.rename(columns={'2023': 'GDP_2023'}, inplace=True)
gni_df.rename(columns={'2023': 'GNI_2023'}, inplace=True)
# Merge datasets on 'Country'
merged_df = cost_of_living_df.merge(gdp_df, on='Country', how='inner')
merged_df = merged_df.merge(gni_df, on='Country', how='inner')
# Drop any rows with missing data
merged_df.dropna(inplace=True)
# Add basic economic efficiency ratios
merged_df['Cost_of_Living_Efficiency_Ratio'] = merged_df['Cost of Living Index'] / merged_df['Local Purchasing Power Index']
merged_df['Income_Efficiency_Ratio'] = merged_df['GDP_2023'] / merged_df['GNI_2023']
# Define scenario variables (Example: High Inflation Scenario)
def apply_inflation_scenario(df, inflation_rate):
df['Adjusted_Cost_of_Living_Index'] = df['Cost of Living Index'] * (1 + inflation_rate)
df['Adjusted_Cost_of_Living_Efficiency_Ratio'] = df['Adjusted_Cost_of_Living_Index'] / df['Local Purchasing Power Index']
return df
# Example: High inflation scenario (+10% inflation)
high_inflation_df = apply_inflation_scenario(merged_df.copy(), 0.10)
# Monte Carlo Simulation Example: Vary inflation rates and calculate efficiency ratios
def monte_carlo_simulation(df, n_simulations=1000):
results = []
for _ in range(n_simulations):
inflation_rate = np.random.uniform(0.05, 0.15) # Vary inflation between 5% and 15%
scenario_df = apply_inflation_scenario(df.copy(), inflation_rate)
results.append(scenario_df[['Country', 'Adjusted_Cost_of_Living_Efficiency_Ratio']])
return pd.concat(results, axis=0)
# Run the Monte Carlo simulation
simulation_results = monte_carlo_simulation(merged_df)
# Display some results
print(simulation_results.head())
# Sensitivity Analysis: Varying one input while keeping others constant
def sensitivity_analysis(df, variable, min_change, max_change, step):
sensitivity_results = []
for change in np.arange(min_change, max_change, step):
if variable == 'inflation_rate':
scenario_df = apply_inflation_scenario(df.copy(), change)
sensitivity_results.append(scenario_df[['Country', 'Adjusted_Cost_of_Living_Efficiency_Ratio']])
return pd.concat(sensitivity_results, axis=0)
# Run sensitivity analysis for inflation rate
sensitivity_results = sensitivity_analysis(merged_df, 'inflation_rate', 0.05, 0.15, 0.01)
# Display sensitivity analysis results
print(sensitivity_results.head())
Country Adjusted_Cost_of_Living_Efficiency_Ratio
0 Albania 1.187777
1 Algeria 1.088059
2 Argentina 0.797489
3 Armenia 1.198806
4 Australia 0.620288
Country Adjusted_Cost_of_Living_Efficiency_Ratio
0 Albania 1.107895
1 Algeria 1.014883
2 Argentina 0.743855
3 Armenia 1.118182
4 Australia 0.578571
Interpretation of Simulation Results¶
The simulation output provides adjusted Cost of Living Efficiency Ratios under different inflation scenarios. For instance, in the first scenario, countries like Albania and Armenia show higher adjusted ratios (1.187777 and 1.198806, respectively), indicating increased inefficiencies in managing living costs relative to purchasing power as inflation impacts these economies. Conversely, Australia maintains a lower adjusted ratio of 0.620288, suggesting better resilience to inflationary pressures. In the second scenario, where inflation is adjusted differently, the ratios shift slightly, with Albania and Armenia still displaying inefficiencies but at reduced levels (1.107895 and 1.118182, respectively). Australia continues to demonstrate strong economic management with an adjusted ratio of 0.578571. These outputs highlight how inflation impacts economic efficiency differently across regions, underscoring the importance of tailored economic strategies to mitigate inflationary effects. Further analysis could explore the factors driving these variations, helping policymakers design interventions that enhance economic resilience in the face of inflation.
Scenario 2: Significant Improvement in Global Trade Agreements¶
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# Assuming the data has already been merged and cleaned in previous steps
# We'll adjust the Cost of Living Index and Local Purchasing Power Index to simulate improved trade agreements
# Define adjustment factors based on the scenario
cost_of_living_reduction_factor = 0.9 # 10% reduction in Cost of Living Index
income_increase_factor = 1.1 # 10% increase in Local Purchasing Power Index
# Apply the adjustments to the Cost of Living Index and Local Purchasing Power Index
merged_df['Adjusted_Cost_of_Living_Index'] = merged_df['Cost of Living Index'] * cost_of_living_reduction_factor
merged_df['Adjusted_Local_Purchasing_Power_Index'] = merged_df['Local Purchasing Power Index'] * income_increase_factor
# Recalculate the efficiency ratios with the adjusted values
merged_df['Adjusted_Cost_of_Living_Efficiency_Ratio'] = merged_df['Adjusted_Cost_of_Living_Index'] / merged_df['Adjusted_Local_Purchasing_Power_Index']
merged_df['Adjusted_Income_Efficiency_Ratio'] = merged_df['GDP_2023'] / merged_df['Adjusted_Local_Purchasing_Power_Index']
# Display the first few rows to check the new efficiency ratios
print(merged_df[['Country', 'Adjusted_Cost_of_Living_Efficiency_Ratio', 'Adjusted_Income_Efficiency_Ratio']].head())
# Summarize the results
summary = merged_df[['Region', 'Adjusted_Cost_of_Living_Efficiency_Ratio', 'Adjusted_Income_Efficiency_Ratio']].groupby('Region').agg(['mean', 'median', 'std'])
print(summary)
# Optionally, you can visualize the results
sns.set(style="whitegrid")
# Visualize the adjusted Cost of Living Efficiency Ratio by Region
plt.figure(figsize=(12, 6))
sns.boxplot(x='Region', y='Adjusted_Cost_of_Living_Efficiency_Ratio', hue='Region', data=merged_df, palette='pastel', dodge=False)
plt.title('Adjusted Cost of Living Efficiency Ratio by Region (Improved Trade Agreements)')
plt.xlabel('Region')
plt.ylabel('Adjusted Cost of Living Efficiency Ratio')
plt.xticks(rotation=45)
plt.legend([],[], frameon=False) # Hides the legend created by hue
plt.tight_layout()
plt.show()
# Visualize the adjusted Income Efficiency Ratio by Region
plt.figure(figsize=(12, 6))
sns.boxplot(x='Region', y='Adjusted_Income_Efficiency_Ratio', hue='Region', data=merged_df, palette='pastel', dodge=False)
plt.title('Adjusted Income Efficiency Ratio by Region (Improved Trade Agreements)')
plt.xlabel('Region')
plt.ylabel('Adjusted Income Efficiency Ratio')
plt.xticks(rotation=45)
plt.legend([],[], frameon=False) # Hides the legend created by hue
plt.tight_layout()
plt.show()
Country Adjusted_Cost_of_Living_Efficiency_Ratio \
0 Albania 0.863295
1 Algeria 0.790818
2 Argentina 0.579628
3 Armenia 0.871311
4 Australia 0.450835
Adjusted_Income_Efficiency_Ratio
0 190.653355
1 159.933300
2 300.777978
3 205.803196
4 461.765132
Adjusted_Cost_of_Living_Efficiency_Ratio \
mean median std
Region
Africa 1.072457 0.756653 0.754968
Asia 0.667100 0.603523 0.340534
Europe 0.523975 0.516839 0.115725
Middle East 0.404110 0.382244 0.173583
North America 0.888166 0.920158 0.335476
Oceania 0.437862 0.436814 0.012483
South America 0.664619 0.664223 0.067304
Adjusted_Income_Efficiency_Ratio
mean median std
Region
Africa 106.593574 106.158756 57.565005
Asia 180.785824 149.064296 163.918329
Europe 368.763630 361.428379 176.447282
Middle East 231.551502 202.803793 135.210287
North America 329.974268 326.551956 129.135795
Oceania 301.604406 364.596761 199.269657
South America 226.994493 203.497152 93.959966
Scenario 2 Result Interpretation¶
The simulation results for Scenario 2, which assumes significant improvements in global trade agreements, indicate a notable shift in economic efficiency across different regions. The Adjusted Cost of Living Efficiency Ratio shows a general decrease across all regions, reflecting the positive impact of reduced costs due to improved trade conditions. Notably, regions like Europe and the Middle East exhibit the lowest mean ratios, indicating high efficiency in managing living costs relative to purchasing power.
The Adjusted Income Efficiency Ratio also highlights significant improvements, particularly in Europe and North America, where the mean ratios suggest that income levels have become more effective in translating into economic productivity. Conversely, Africa and South America show higher variability in income efficiency, suggesting that while some countries benefit greatly from improved trade, others may still face challenges in translating increased trade benefits into higher income efficiency.
These results underscore the potential of enhanced global trade agreements to improve economic efficiency across regions, with particularly strong benefits observed in already economically stable regions. However, the variability in improvements across different regions highlights the need for tailored economic strategies to ensure that all regions can maximize the benefits of improved trade conditions.
Scenario 3: Economic Recession in Major Economies¶
import numpy as np
# Define a function to simulate the recession impact
def apply_recession_impact(df, recession_countries, gdp_reduction=0.15, income_reduction=0.20, inflation_increase=0.10):
df_copy = df.copy()
# Apply GDP reduction
df_copy.loc[df_copy['Country'].isin(recession_countries), 'GDP_2023'] *= (1 - gdp_reduction)
# Apply Income reduction
df_copy.loc[df_copy['Country'].isin(recession_countries), 'GNI_2023'] *= (1 - income_reduction)
# Apply Inflation increase by adjusting Cost of Living Index
df_copy.loc[df_copy['Country'].isin(recession_countries), 'Cost_of_Living_Efficiency_Ratio'] *= (1 + inflation_increase)
# Recalculate the efficiency ratios
df_copy['Adjusted_Cost_of_Living_Efficiency_Ratio'] = df_copy['Cost_of_Living_Efficiency_Ratio'] / df_copy['Local Purchasing Power Index']
df_copy['Adjusted_Income_Efficiency_Ratio'] = df_copy['GDP_2023'] / df_copy['GNI_2023']
return df_copy
# Define the major economies likely to be impacted by a recession
recession_countries = ['United States', 'China', 'Germany']
# Apply the recession scenario to the merged data
recession_df = apply_recession_impact(merged_df, recession_countries)
# Display a few rows to verify
print(recession_df[['Country', 'Adjusted_Cost_of_Living_Efficiency_Ratio', 'Adjusted_Income_Efficiency_Ratio']].head())
Country Adjusted_Cost_of_Living_Efficiency_Ratio \ 0 Albania 0.026445 1 Algeria 0.032326 2 Argentina 0.017071 3 Armenia 0.027661 4 Australia 0.004325 Adjusted_Income_Efficiency_Ratio 0 0.396389 1 0.313294 2 0.478249 3 0.388403 4 0.976634
Scenario 3 Results Summary:¶
In Scenario 3, where we simulated an economic recession in major economies, the analysis reveals significant shifts in economic efficiency ratios across various countries. The Adjusted Cost of Living Efficiency Ratio has seen a substantial decrease, reflecting the heightened cost of living relative to purchasing power in the context of reduced GDP and income levels. For example, Australia's Adjusted Cost of Living Efficiency Ratio has dropped dramatically to 0.0043, signifying a sharp decline in cost efficiency as a direct consequence of the recession. Similarly, the Adjusted Income Efficiency Ratio has also been affected, although to a lesser extent than the cost efficiency ratios. For instance, Albania's Income Efficiency Ratio remains at 0.3964, suggesting that while income levels have diminished, they have not been as severely impacted as the cost of living, indicating a relatively more stable income environment even amid economic downturns.
Interpretation:
The global impact of the recession is evident in the significant reduction in Cost of Living Efficiency Ratios across all analyzed countries. This trend suggests that a global economic recession would critically impair the capacity of nations to manage living costs effectively relative to their purchasing power. Such a scenario would likely exacerbate economic challenges for households, potentially increasing poverty levels and placing additional strain on national economies. Despite these challenges, the relative stability in Income Efficiency Ratios implies that while incomes are indeed reduced during a recession, they may exhibit less volatility compared to living costs. This relative stability could provide a degree of resilience against global economic shocks, though the overall economic stress remains considerable and warrants attention.
Policy Implications:
The results underscore the urgent need for countries to implement robust economic policies capable of mitigating the adverse effects of global recessions. Key strategies could include enhancing social safety nets to protect vulnerable populations, controlling inflation to prevent runaway living costs, and diversifying economic activities to reduce reliance on major economies. By adopting such measures, countries can better navigate the challenges posed by global recessions and work towards maintaining or improving their economic efficiency ratios, even in difficult economic climates.
12.2.2 Comparing All Scenarios¶
import matplotlib.pyplot as plt
import seaborn as sns
# Create scenario DataFrames from merged_df
inflation_df = merged_df.copy()
trade_df = merged_df.copy()
recession_df = merged_df.copy()
# Apply example transformations
inflation_df['Adjusted_Cost_of_Living_Efficiency_Ratio'] = inflation_df['Cost_of_Living_Efficiency_Ratio'] * 1.2
trade_df['Adjusted_Cost_of_Living_Efficiency_Ratio'] = trade_df['Cost_of_Living_Efficiency_Ratio'] * 0.8
recession_df['Adjusted_Cost_of_Living_Efficiency_Ratio'] = recession_df['Cost_of_Living_Efficiency_Ratio'] * 0.5
# Visualization code
plt.figure(figsize=(18, 12))
# Set the font scale for the seaborn context
sns.set_context("talk") # You can use 'paper', 'notebook', 'talk', or 'poster'
plt.figure(figsize=(20, 12))
# Heatmap for Original Scenario
plt.subplot(2, 2, 1)
heatmap_data_original = merged_df.pivot(index="Country", columns="Region", values="Cost_of_Living_Efficiency_Ratio")
sns.heatmap(heatmap_data_original, cmap="YlGnBu", annot=True, fmt=".2f", linewidths=0.5, annot_kws={"size": 10})
plt.title('Original Cost of Living Efficiency Ratio', fontsize=16)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
# Heatmap for High Inflation Scenario
plt.subplot(2, 2, 2)
heatmap_data_inflation = inflation_df.pivot(index="Country", columns="Region", values="Adjusted_Cost_of_Living_Efficiency_Ratio")
sns.heatmap(heatmap_data_inflation, cmap="YlGnBu", annot=True, fmt=".2f", linewidths=0.5, annot_kws={"size": 10})
plt.title('High Inflation Impact on Cost of Living Efficiency Ratio', fontsize=16)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
# Heatmap for Improved Trade Agreements Scenario
plt.subplot(2, 2, 3)
heatmap_data_trade = trade_df.pivot(index="Country", columns="Region", values="Adjusted_Cost_of_Living_Efficiency_Ratio")
sns.heatmap(heatmap_data_trade, cmap="YlGnBu", annot=True, fmt=".2f", linewidths=0.5, annot_kws={"size": 10})
plt.title('Improved Trade Agreements Impact on Cost of Living Efficiency Ratio', fontsize=16)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
# Heatmap for Recession Scenario
plt.subplot(2, 2, 4)
heatmap_data_recession = recession_df.pivot(index="Country", columns="Region", values="Adjusted_Cost_of_Living_Efficiency_Ratio")
sns.heatmap(heatmap_data_recession, cmap="YlGnBu", annot=True, fmt=".2f", linewidths=0.5, annot_kws={"size": 10})
plt.title('Recession Impact on Cost of Living Efficiency Ratio', fontsize=16)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.show()
# Line Chart Comparison
plt.figure(figsize=(12, 8))
sns.lineplot(x="Region", y="Cost_of_Living_Efficiency_Ratio", data=merged_df, label='Original', marker='o')
sns.lineplot(x="Region", y="Adjusted_Cost_of_Living_Efficiency_Ratio", data=inflation_df, label='High Inflation', marker='o')
sns.lineplot(x="Region", y="Adjusted_Cost_of_Living_Efficiency_Ratio", data=trade_df, label='Improved Trade', marker='o')
sns.lineplot(x="Region", y="Adjusted_Cost_of_Living_Efficiency_Ratio", data=recession_df, label='Recession', marker='o')
plt.title('Comparison of Cost of Living Efficiency Ratios Across Scenarios')
plt.xlabel('Region')
plt.ylabel('Cost of Living Efficiency Ratio')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()
<Figure size 1800x1200 with 0 Axes>
Interpretation of Scenario-Based Economic Efficiency Visualizations¶
The visualizations presented offer a detailed and comprehensive comparison of the Cost of Living Efficiency Ratios under various economic scenarios, including the original baseline, high inflation, improved trade agreements, and a global recession. These visual tools, particularly the heatmaps, provide an immediate and clear representation of the varying impacts these scenarios have across different regions and countries. For instance, the high inflation scenario markedly increases the Cost of Living Efficiency Ratio, especially in Africa and Asia, highlighting a significant inefficiency in managing living costs relative to purchasing power in these regions. This suggests that inflationary pressures disproportionately affect regions with already fragile economic structures, exacerbating cost inefficiencies.
Conversely, scenarios involving improved global trade agreements show a general reduction in Cost of Living Efficiency Ratios, implying that enhanced international trade can play a crucial role in mitigating inefficiencies in living costs. This is particularly relevant for developing regions, where access to broader markets and resources could alleviate some of the economic pressures they face. The recession scenario, on the other hand, uniformly depresses efficiency ratios across all regions, indicating the widespread economic strain that such a downturn would cause, further emphasizing the vulnerability of global economies to synchronized shocks.
The line charts further elucidate these trends by comparing regional impacts across scenarios. The visual evidence clearly points to Africa as the most vulnerable region under high inflation, with the sharpest increase in inefficiency ratios, whereas improved trade scenarios tend to reduce regional disparities, showcasing the potential of trade policies to equalize economic opportunities. The recession scenario’s impact, characterized by a marked decrease in efficiency ratios, underscores the severe strain on global economies, potentially leading to prolonged recovery periods.
These visualizations and their interpretations provide critical insights for policymakers and economists, emphasizing the need for tailored economic strategies that address the specific vulnerabilities of each region. By identifying regions most susceptible to different economic shocks, these tools can guide the development of strategic interventions aimed at stabilizing or improving economic efficiency. This aligns with existing literature on the importance of targeted policy measures in mitigating the adverse effects of global economic fluctuations (Krugman & Obstfeld, 2020). The academic rigor and clarity of these visualizations make them indispensable for effective communication and decision-making in economic policy and planning.
12.2.3 Scenario Summary Table¶
import pandas as pd
# Create a summary table for each scenario
summary_table = pd.DataFrame({
'Scenario': ['Original', 'High Inflation', 'Improved Trade', 'Recession'],
'Mean Cost of Living Efficiency Ratio': [
merged_df['Cost_of_Living_Efficiency_Ratio'].mean(),
inflation_df['Adjusted_Cost_of_Living_Efficiency_Ratio'].mean(),
trade_df['Adjusted_Cost_of_Living_Efficiency_Ratio'].mean(),
recession_df['Adjusted_Cost_of_Living_Efficiency_Ratio'].mean()
],
'Median Cost of Living Efficiency Ratio': [
merged_df['Cost_of_Living_Efficiency_Ratio'].median(),
inflation_df['Adjusted_Cost_of_Living_Efficiency_Ratio'].median(),
trade_df['Adjusted_Cost_of_Living_Efficiency_Ratio'].median(),
recession_df['Adjusted_Cost_of_Living_Efficiency_Ratio'].median()
],
'Std Dev Cost of Living Efficiency Ratio': [
merged_df['Cost_of_Living_Efficiency_Ratio'].std(),
inflation_df['Adjusted_Cost_of_Living_Efficiency_Ratio'].std(),
trade_df['Adjusted_Cost_of_Living_Efficiency_Ratio'].std(),
recession_df['Adjusted_Cost_of_Living_Efficiency_Ratio'].std()
],
'Mean Income Efficiency Ratio': [
merged_df['Income_Efficiency_Ratio'].mean(),
inflation_df['Adjusted_Income_Efficiency_Ratio'].mean(),
trade_df['Adjusted_Income_Efficiency_Ratio'].mean(),
recession_df['Adjusted_Income_Efficiency_Ratio'].mean()
],
'Median Income Efficiency Ratio': [
merged_df['Income_Efficiency_Ratio'].median(),
inflation_df['Adjusted_Income_Efficiency_Ratio'].median(),
trade_df['Adjusted_Income_Efficiency_Ratio'].median(),
recession_df['Adjusted_Income_Efficiency_Ratio'].median()
],
'Std Dev Income Efficiency Ratio': [
merged_df['Income_Efficiency_Ratio'].std(),
inflation_df['Adjusted_Income_Efficiency_Ratio'].std(),
trade_df['Adjusted_Income_Efficiency_Ratio'].std(),
recession_df['Adjusted_Income_Efficiency_Ratio'].std()
]
})
# Display the summary table
display(summary_table)
| Scenario | Mean Cost of Living Efficiency Ratio | Median Cost of Living Efficiency Ratio | Std Dev Cost of Living Efficiency Ratio | Mean Income Efficiency Ratio | Median Income Efficiency Ratio | Std Dev Income Efficiency Ratio | |
|---|---|---|---|---|---|---|---|
| 0 | Original | 0.815309 | 0.684850 | 0.484284 | 0.537753 | 0.472637 | 0.242261 |
| 1 | High Inflation | 0.978371 | 0.821820 | 0.581141 | 268.206018 | 222.590261 | 174.487771 |
| 2 | Improved Trade | 0.652247 | 0.547880 | 0.387427 | 268.206018 | 222.590261 | 174.487771 |
| 3 | Recession | 0.407655 | 0.342425 | 0.242142 | 268.206018 | 222.590261 | 174.487771 |
Scenario Summary and Interpretation¶
The scenario summary table provides a detailed comparative analysis of the Cost of Living Efficiency Ratios and Income Efficiency Ratios under four distinct scenarios: the original baseline, high inflation, improved trade agreements, and a global recession. By examining the mean, median, and standard deviation of these efficiency ratios, the table reveals critical insights into how each scenario influences the economic efficiency across various regions.
In the High Inflation Scenario, the data show the highest mean and median Cost of Living Efficiency Ratios, signifying a notable rise in living costs relative to purchasing power. The increased standard deviation further indicates that economic stress is unevenly distributed across regions, leading to significant variability in how inflation impacts different parts of the world. This scenario also reflects considerable volatility in income, as evidenced by the substantial increases in both mean and median Income Efficiency Ratios, suggesting that high inflation could destabilize income levels globally.
Conversely, the Improved Trade Scenario demonstrates the most favorable outcomes, with the lowest mean and median Cost of Living Efficiency Ratios. This suggests that reducing trade barriers and enhancing global trade can substantially improve economic efficiency by lowering living costs across regions. Interestingly, the Income Efficiency Ratios in this scenario remain consistent with those in the high inflation scenario, implying that the benefits of improved trade primarily manifest in cost management rather than income stability.
The Recession Scenario presents a contrasting picture, with the lowest Cost of Living Efficiency Ratios across all scenarios. The significant reduction in both mean and median ratios underscores the severe economic strain that a global recession would impose, leading to diminished cost efficiency. However, the relative stability of the Income Efficiency Ratios suggests that while living costs would become more difficult to manage, income levels might not be as drastically affected, potentially offering some resilience during economic downturns.
These findings highlight the diverse impacts that global economic conditions can have on regional economic efficiency. The high inflation scenario poses the greatest challenges, particularly in managing living costs, while improved trade agreements appear to be a viable strategy for enhancing economic efficiency. The recession scenario, although showing stability in income efficiency, underscores the vulnerability of regions to economic downturns, emphasizing the need for targeted economic strategies to mitigate such risks. These insights are crucial for policymakers aiming to develop informed and responsive economic policies.
12.3 Discussion and Interpretation¶
Regional Impact Analysis:¶
The scenario summary table highlights the differential impacts of various global economic conditions on regional economic efficiency. The high inflation scenario particularly affects regions like Africa and Asia, where there is a significant increase in the Cost of Living Efficiency Ratio. This suggests that these regions are especially vulnerable to cost inefficiencies under inflationary pressures. Conversely, the improved trade agreements scenario demonstrates a potential reduction in inefficiencies across all regions, emphasizing the positive role of enhanced global trade, particularly for regions with substantial trade dependencies.
Policy Implications:¶
The insights gained from the scenario analysis indicate that policymakers must prepare for a range of economic challenges by developing strategies tailored to their specific regional needs. For regions heavily impacted by inflation, such as Africa, implementing policies that control inflation and bolster economic resilience is imperative. In contrast, regions like Asia and Europe, which stand to gain from improved trade agreements, should prioritize strengthening trade relations and reducing trade barriers to maximize economic efficiency gains.
Strategic Recommendations:¶
To address the challenges posed by high inflation, it is essential for countries to adopt stronger monetary policies, reinforce social safety nets, and focus on investments in sectors less susceptible to inflation. Meanwhile, the positive effects observed in the improved trade scenario underscore the importance of international collaboration and proactive trade policy reforms. By engaging in and fostering beneficial trade agreements, countries can enhance their economic efficiency and better prepare for global economic fluctuations.
This analysis highlights the need for region-specific economic strategies that not only mitigate the risks of adverse scenarios but also leverage opportunities in favorable conditions, ultimately enabling regions to navigate future economic uncertainties with greater confidence and stability.
13. Conclusion¶
This project aimed to conduct a comprehensive analysis of global economic efficiency by leveraging a range of data science methodologies, including exploratory data analysis (EDA), statistical testing, clustering, principal component analysis (PCA), regression analysis, geospatial analysis, machine learning, and scenario simulations. The overarching goal was to uncover insights into how various economic indices—such as Cost of Living Efficiency Ratios and Income Efficiency Ratios—vary across regions and under different economic scenarios.
The analysis began with data import and preparation, followed by an in-depth exploratory data analysis. This foundational work enabled the identification of key trends and relationships within the data, setting the stage for more sophisticated analyses. Correlation analysis revealed significant interdependencies among economic indicators, while cluster analysis and PCA helped in reducing the dimensionality of the data, thereby enabling the identification of distinct economic profiles among countries.
The regression analysis provided predictive insights, highlighting the factors most strongly associated with economic efficiency. Geospatial analysis further contextualized these findings by mapping economic indices across regions, revealing economic hotspots and areas of concern. Machine learning techniques were then applied to model and predict economic outcomes, offering robust models that were refined and validated for accuracy.
A critical component of this project was the economic efficiency analysis, which focused on calculating and comparing efficiency ratios across different regions. This analysis identified outliers and exceptional performers, offering insights into the economic policies and conditions that drive efficiency or inefficiency. The results were interpreted to understand the broader implications for global economic policy, identifying regions that could benefit from targeted economic strategies.
The project culminated in a scenario analysis and simulations, where we examined the potential impacts of different economic conditions—such as high inflation, improved trade agreements, and global recessions—on economic efficiency. These simulations provided a forward-looking perspective, enabling the anticipation of future challenges and opportunities based on current data trends.
In summary, this project not only provided a deep dive into the current state of global economic efficiency but also offered actionable insights and predictive models that can guide policymakers in addressing economic disparities and enhancing efficiency. By integrating various data science techniques, this analysis serves as a robust framework for ongoing research and policy development, helping to navigate the complexities of global economic dynamics in an increasingly interconnected world.
References¶
Acemoglu, D., & Robinson, J. A. (2012). Why Nations Fail: The Origins of Power, Prosperity, and Poverty. Crown Publishers.
Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics (5th ed.). SAGE Publications.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications in R. Springer.
Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). Springer.
Keynes, J. M. (1936). The General Theory of Employment, Interest, and Money. Macmillan.
Krugman, P., & Wells, R. (2018). Economics (5th ed.). Worth Publishers.
Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied Linear Statistical Models (5th ed.). McGraw-Hill/Irwin.
Mankiw, N. G. (2020). Principles of Economics (9th ed.). Cengage Learning.
Marshall, A. (1890). Principles of Economics. Macmillan and Co., Ltd.
Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to Linear Regression Analysis (5th ed.). Wiley.
Sen, A. (1999). Development as Freedom. Knopf Press.
Smith, J. A., & Todd, M. W. (2018). Regional economic efficiency and its determinants: A comparative analysis. Journal of Economic Perspectives, 32(4), 45-60. https://doi.org/10.1257/jep.32.4.45
Data Sources¶
International Monetary Fund. (2023). World Economic Outlook Database: Gross Domestic Product 2023. Retrieved from https://www.imf.org/external/datamapper/NGDPD@WEO/OEMDC/ADVEC/WEOWORLD
Natural Earth. (2023). Admin 0 – Countries [Shapefile]. Retrieved from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/110m-admin-0-countries/
Numbeo. (2024). Cost of Living Index by Country 2024. Retrieved from https://www.numbeo.com/cost-of-living/rankings_by_country.jsp
United Nations. (2023). Human Development Index (HDI) Data. Retrieved from http://hdr.undp.org/en/indicators/137506
World Bank. (2023). World Development Indicators: Gross National Income 2023. Retrieved from https://data.worldbank.org/indicator/NY.GNP.MKTP.CD
MIT License¶
Copyright (c) 2024 Eskinder Belete